<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: velprove</title>
    <description>The latest articles on DEV Community by velprove (@velprove).</description>
    <link>https://dev.to/velprove</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847633%2F11c1a231-027c-4c79-89cc-c40cd77f4834.png</url>
      <title>DEV Community: velprove</title>
      <link>https://dev.to/velprove</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/velprove"/>
    <language>en</language>
    <item>
      <title>Vercel FRA1 CDN Failures, June 2026: Monitoring a Vercel App Across Regions</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Tue, 16 Jun 2026 14:00:02 +0000</pubDate>
      <link>https://dev.to/velprove/vercel-fra1-cdn-failures-june-2026-monitoring-a-vercel-app-across-regions-1bd5</link>
      <guid>https://dev.to/velprove/vercel-fra1-cdn-failures-june-2026-monitoring-a-vercel-app-across-regions-1bd5</guid>
      <description>&lt;p&gt;&lt;strong&gt;The blind spot:&lt;/strong&gt; the Vercel FRA1 outage in June 2026 was a partial degradation of one region, not a platform-wide outage: on June 14 2026, Vercel's FRA1 CDN in Frankfurt hit elevated latency and errors, and Vercel rerouted that traffic to CDG1 in Paris until Frankfurt recovered. If your one monitoring vantage never routes through Frankfurt, you see 100 percent green while Frankfurt-served users get errors, so you do not even know there is an incident. The way to monitor a Vercel app across regions is coverage: run a monitor from each region your users come from, against the same production URL, so a one-region failure lands on a monitor that was actually sampling there. Velprove gives you 5 probe regions on every plan, and its free, no-code browser login monitor walks the real sign-in path from the region you choose. To be clear up front, Velprove did not monitor this incident and we make no claim that we detected it or would have; we use its shape to show why running monitors from more than one region is what surfaces a single-region failure at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  What FRA1 errored on June 14, 2026 (and the rest of Vercel that didn't)
&lt;/h2&gt;

&lt;p&gt;What was affected here was narrow, and the narrowness is the whole point. Vercel named exactly one thing: the CDN in the FRA1 region, Frankfurt, Germany. The verbatim incident title on Vercel's status page is "Elevated latency and errors in FRA1 Vercel Region (Frankfurt, Germany)." Not Functions. Not Serverless or Builds. Not the Edge. Not any other Vercel region. One CDN region, classified Major, a partial degradation of elevated latency and errors rather than a full stoppage. Requests through Frankfurt were slow and some errored; the platform as a whole kept running.&lt;/p&gt;

&lt;p&gt;What did not break matters just as much, because it is what makes this kind of failure easy to miss. A Vercel-hosted app serving users in North America, the UK, or Asia saw nothing. Their traffic never touched FRA1. The status code on their requests stayed clean the entire window. The blast radius was one region's CDN, and unless your users or your monitors actually route through Frankfurt, the incident is invisible to you by construction, not by accident.&lt;/p&gt;

&lt;p&gt;One disambiguation, because June 2026 had more than one Vercel incident. This is not the June 9 build-error incident, and it is not the June 8 DUB1 Functions errors. Different components, different regions, different days. Name them only to keep them apart; do not conflate them with this FRA1 CDN event.&lt;/p&gt;

&lt;h2&gt;
  
  
  The FRA1 timeline (UTC, from Vercel's status page)
&lt;/h2&gt;

&lt;p&gt;The timeline below is taken from Vercel's status page incident permalink. All times UTC. Source: &lt;a href="https://www.vercel-status.com/incidents/2f4kf430rmlz" rel="noopener noreferrer"&gt;Vercel Status incident 2f4kf430rmlz&lt;/a&gt; .&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Update&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Jun 14, 21:03&lt;/td&gt;
&lt;td&gt;Investigating&lt;/td&gt;
&lt;td&gt;"We are currently re-routing traffic from the FRA1 Vercel Region to the CDG1 Vercel Region. We are currently investigating this issue."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 14, 21:07&lt;/td&gt;
&lt;td&gt;Identified&lt;/td&gt;
&lt;td&gt;"We are currently re-routing traffic from the FRA1 Vercel Region to the CDG1 Vercel Region."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 14, 21:26&lt;/td&gt;
&lt;td&gt;Update&lt;/td&gt;
&lt;td&gt;"The impact is currently mitigated."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 14, 23:15&lt;/td&gt;
&lt;td&gt;Update&lt;/td&gt;
&lt;td&gt;"The issue has been fixed. We are starting to re-route traffic from the CDG1 Vercel Region back to the FRA1 Vercel Region."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 15, 00:25&lt;/td&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;"All traffic has been restored to the FRA1 Vercel Region."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 15, 01:24&lt;/td&gt;
&lt;td&gt;Resolved&lt;/td&gt;
&lt;td&gt;"This incident has been resolved."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two bookend timestamps run from 2026-06-14 21:03 UTC to 2026-06-15 01:24 UTC, which works out to approximately 4 hours 20 minutes. That duration figure is our derivation from the two published timestamps, not a number Vercel stated. And it is worth reading carefully, because the elapsed time and the user-facing impact are not the same thing. Vercel reported the impact mitigated at 21:26 UTC, roughly 22 minutes after the incident opened. The bulk of the remaining window was Vercel cautiously rerouting traffic from Paris back to Frankfurt: the fix landed at 23:15 UTC, all traffic was restored to FRA1 at 00:25 UTC the next day, and the incident was formally resolved at 01:24 UTC. So the sharp user-facing pain was concentrated in that first ~22 minutes; the long tail was a careful, staged return to the home region, not four-plus hours of continuous errors.&lt;/p&gt;

&lt;p&gt;Vercel did not publish a root cause. There is no postmortem at the incident URL as of writing, which is normal for a single-region partial degradation. We do not speculate on why FRA1 degraded. No AWS region, no capacity story, no bad deploy, no traffic spike. The record says elevated latency and errors, mitigated by rerouting; we quote that and stop. This is also effectively a single source, Vercel's own status page, with no third-party corroboration, which is stated here so the post claims no more than the record supports.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Vercel reroutes a failing CDN region (and why it can still hit users)
&lt;/h2&gt;

&lt;p&gt;Vercel's CDN is built as a wide front edge over a smaller compute footprint: &lt;a href="https://vercel.com/docs/regions" rel="noopener noreferrer"&gt;over 126 Points of Presence in front of 20 compute-capable regions&lt;/a&gt; . FRA1 is Frankfurt, Germany; CDG1 is Paris, France, the next-closest region. Vercel's documented behavior is exactly what played out here: "In the event of regional downtime, application traffic is automatically rerouted to the next closest region." When Frankfurt degraded, Vercel rerouted that region's CDN traffic to Paris, then routed it back to Frankfurt once the region recovered. Vercel said as much in its own updates: "We are currently re-routing traffic from the FRA1 Vercel Region to the CDG1 Vercel Region," and later, "The issue has been fixed. We are starting to re-route traffic from the CDG1 Vercel Region back to the FRA1 Vercel Region."&lt;/p&gt;

&lt;p&gt;One accuracy point that is easy to get wrong, so hold it precisely. This CDN-region reroute is platform behavior that Vercel operates for everyone. It is not the same thing as automatic Vercel Function failover across regions, which is an Enterprise-only feature. Do not read this incident as "every Vercel customer gets automatic multi-region failover." They do not. What happened here is Vercel, as the platform operator, steering CDN traffic away from a sick region and back again. That is distinct from your Functions automatically failing over to another region, which only Enterprise plans get.&lt;/p&gt;

&lt;p&gt;And here is the bridge to why any of this matters for monitoring. Rerouting is not instant and it is not free of impact. There is a real window, before and around mitigation, where requests pathing through Frankfurt saw elevated latency and errors. The platform absorbed it well and recovered, but "the platform handled it" and "no user felt it" are different claims. Users hitting FRA1 during that window experienced the degradation. If those were your users, the incident was real for your product, whatever the rest of the world saw.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a single monitoring vantage misses a one-region failure
&lt;/h2&gt;

&lt;p&gt;Here is the failure shape this post is built around, and it is different from the two we have already taught. A single-region CDN degradation is invisible to a monitor that never routes through that region. If your one monitoring vantage runs from, say, North America, and FRA1 degrades in Frankfurt, your monitor reports a clean 100 percent green the entire time. Not because the monitor is misconfigured, but because nothing your monitor touched was broken. The errors were happening to a population you were not watching. You do not have a false green; you have no signal at all, because you never sampled the place that failed.&lt;/p&gt;

&lt;p&gt;Be precise about what this is and is not. This is not a 200-but-slow problem. In a 200-but-slow degradation, your monitor does hit the affected path and gets a successful status code that hides a real slowdown; the catch there is a response-time threshold, and that is the subject of a separate teardown. Here the issue is upstream of thresholds entirely: your monitor never even saw the failing region, so there is nothing for any assertion to catch. It is also not the differential-localization story, where you run probes from several regions to work out which region is slow. That assumes you already know there is an incident and are localizing it. This is a step before that: without coverage of the failing region, you do not know there is an incident to localize.&lt;/p&gt;

&lt;p&gt;So the owned lesson here is coverage, not localization. And be exact about the mechanism, because it is easy to overstate. One Velprove monitor watches one region you pick. It does not watch all regions at once. Coverage means running monitors from more than one vantage, so that the regions your users actually come from are each being sampled by something. If Frankfurt is a place your users live, something of yours has to be watching from there, or the next FRA1-shaped incident is a green dashboard and a support ticket arriving at the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring a Vercel-hosted app across regions with Velprove
&lt;/h2&gt;

&lt;p&gt;The primitive is a Velprove HTTP monitor pointed at your real production URL, with the probe origin set to a region you care about. Velprove offers 5 probe regions on every plan: North America, Europe, United Kingdom, Asia, and Oceania. Each monitor runs from one of them. To cover more than one vantage, you create more than one monitor, one per region, against the same URL. That is what turns a single blind spot into actual geographic coverage: if a Europe-origin monitor is watching while an FRA1-shaped degradation hits, that monitor is the one that goes red, while your North America monitor, correctly, stays green because nothing it touched broke.&lt;/p&gt;

&lt;p&gt;A light, useful detail on the HTTP side: Vercel sets an &lt;code&gt;x-vercel-id&lt;/code&gt; header on responses, shaped roughly as &lt;code&gt;&amp;lt;pop&amp;gt;::&amp;lt;region&amp;gt;::&amp;lt;hash&amp;gt;&lt;/code&gt;. Its presence confirms the response came back through a Vercel edge node rather than a stale cached error page from somewhere upstream. You do not need to over-engineer assertions on it; a status-code assertion plus a body-contains assertion on a string your page always returns is the workhorse pair. The point of multi-region here is not a fancier assertion, it is sampling the right places.&lt;/p&gt;

&lt;p&gt;There is a second, sharper layer of coverage that an HTTP probe alone does not give you, and it is Velprove's strongest instrument here: the browser login monitor. It opens a real browser, signs into your own login from the probe region you choose, and verifies that a known string from the post-login page is present. Run it from a non-default region like Europe and it walks the actual user path, sign-in and the redirect that follows, through that region's CDN, the way a real customer in Frankfurt would, with no code to write. That is a stronger statement than an HTTP status code: it confirms a real session can be established through that edge, not just that a URL answered. If Frankfurt-served sign-ins were failing while the rest of the world stayed green, this is the monitor that goes red.&lt;/p&gt;

&lt;h3&gt;
  
  
  Picking which regions to cover
&lt;/h3&gt;

&lt;p&gt;Cover where your users come from, not every region for its own sake. If your traffic is mostly European, a Europe vantage is the one that has to exist; if you have a meaningful Asia or North America base, add those. The reason this is reachable without paying is the free plan's shape: 10 monitors and all 5 regions are available on Free, so a two- or three-vantage spread across the same URL fits comfortably inside the free tier. You do not need a paid plan to stop being single-vantage. Pick the two or three regions your customers actually live in, point a monitor from each, and the FRA1-shaped blind spot closes for the places that matter to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this teardown does and does not claim
&lt;/h2&gt;

&lt;p&gt;The honest version of this post is the one that draws its own boundaries, so here they are plainly.&lt;/p&gt;

&lt;p&gt;We did not detect this. Velprove did not monitor Vercel's FRA1 region during this incident. We are not claiming we caught it, and we are not claiming we would have caught it as a matter of fact. Everything above is the failure shape that multi-region coverage is built to surface, presented as a worked example, not a detection war story.&lt;/p&gt;

&lt;p&gt;There is no detection-time lead to quote. The incident opened on Vercel's status page at its own start time; there is no published gap between impact beginning and acknowledgment for us to measure a lead against, and we do not invent one.&lt;/p&gt;

&lt;p&gt;Coverage surfaces, it does not prevent. Watching FRA1 from a Europe vantage would tell you sooner that Frankfurt-served users are seeing errors. It would not have stopped FRA1 from degrading. That is Vercel's platform to fix, and they did.&lt;/p&gt;

&lt;p&gt;We do not know the root cause. Vercel did not publish one. We do not fill that gap with a guess about AWS, capacity, deploys, or traffic.&lt;/p&gt;

&lt;p&gt;One region per monitor. The coverage above is the result of running several monitors, one per region, against the same URL. A single monitor is a single vantage, by design.&lt;/p&gt;

&lt;p&gt;And the reroute is not Enterprise Function failover. The CDN-region rerouting Vercel performed here is platform behavior for everyone; automatic cross-region Function failover is a separate, Enterprise-only feature. Do not read one as the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern, not just this incident
&lt;/h2&gt;

&lt;p&gt;A single-region degradation on a platform you do not fully watch is a recurring shape, not a one-off. The instrument outlives this particular Frankfurt blip: monitors running from the regions your users actually come from, so that a failure localized to one region lands on a monitor that was sampling there. Build that coverage once and it is ready the next time any one region of any provider goes sideways while the rest of the world sees green.&lt;/p&gt;

&lt;p&gt;This teardown reads as a pair with one we published recently. In &lt;a href="https://velprove.com/blog/supabase-storage-latency-june-2026" rel="noopener noreferrer"&gt;the Supabase Storage latency teardown&lt;/a&gt; , the failure was a region that was slow but still returning 200s: the monitor was hitting the affected path, the status code lied by staying green, and the catching instrument was a response-time threshold. Here the distinction is errors versus slow, and coverage versus depth. There the monitor saw the right place and needed a sharper assertion; here the monitor needs to see the place at all. Both are members of the same broader family, a green dashboard hiding a degraded reality, which we catalog in &lt;a href="https://velprove.com/blog/anatomy-of-a-silent-outage" rel="noopener noreferrer"&gt;the anatomy of a silent outage&lt;/a&gt; and in &lt;a href="https://velprove.com/blog/why-uptime-monitors-miss-outages" rel="noopener noreferrer"&gt;why uptime monitors miss outages&lt;/a&gt; . Errors in a region you do not watch is one of the quieter ways a monitor can be green and wrong at the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor your Vercel app
&lt;/h2&gt;

&lt;p&gt;A single-region CDN failure like FRA1 on June 14 is invisible to a monitor that never routes through the failing region, and visible to a monitor that does. The fix is coverage: run monitors from the regions your users come from, so a one-region degradation lands somewhere you are actually watching. Velprove's free plan covers the setup: 10 monitors, all 5 regions on every plan, no credit card. Browser login monitors are free, so you can walk the real sign-in path from a non-default region. And if your check needs an auth step first, multi-step API monitors are free up to 3 steps. Point an HTTP monitor at your production URL from each region your customers live in, add a browser login monitor from the region you care about most, and the next single-region incident shows up on a monitor instead of in a support ticket.&lt;/p&gt;

&lt;p&gt;This post owns the coverage angle. For the deeper platform surface, Cron, Marketplace storage, regional Functions, see &lt;a href="https://velprove.com/blog/monitor-vercel-hosted-site" rel="noopener noreferrer"&gt;monitor a Vercel site at the platform layer&lt;/a&gt; . &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start free&lt;/a&gt; and point a monitor at your Vercel app from the regions that matter to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What happened during the Vercel FRA1 incident on June 14, 2026?
&lt;/h3&gt;

&lt;p&gt;Vercel's CDN in the FRA1 region, Frankfurt, Germany, hit elevated latency and errors starting 2026-06-14 21:03 UTC. Vercel rerouted that region's CDN traffic to CDG1, Paris, reported impact mitigated at 21:26 UTC, then routed traffic back to Frankfurt, with all traffic restored to FRA1 at 2026-06-15 00:25 UTC and the incident resolved at 01:24 UTC. Vercel classified it Major and it was a partial degradation, elevated latency and errors, not a full outage. Only the FRA1 CDN was named; no other component or region was. Vercel did not publish a root cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  Was all of Vercel down during the FRA1 incident?
&lt;/h3&gt;

&lt;p&gt;No. This was a partial degradation of one CDN region, Frankfurt, not a platform-wide outage. Vercel named only the FRA1 CDN; Functions, Builds, the Edge, and every other Vercel region were not named as affected. Apps whose users did not route through Frankfurt saw nothing. Do not conflate this with the separate June 2026 Vercel incidents, the June 9 build errors and the June 8 DUB1 Functions errors, which were different components and regions.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long did the Vercel FRA1 incident last?
&lt;/h3&gt;

&lt;p&gt;The incident ran from 2026-06-14 21:03 UTC to 2026-06-15 01:24 UTC, approximately 4 hours 20 minutes. That figure is our derivation from the two published timestamps, not a number Vercel stated. The sharp user-facing impact was concentrated in the first ~22 minutes: Vercel reported impact mitigated at 21:26 UTC. Most of the remaining time was Vercel cautiously rerouting traffic from Paris back to Frankfurt, with the fix at 23:15 UTC and all traffic restored to FRA1 at 00:25 UTC, so the long tail was a staged return rather than continuous errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why did Vercel reroute FRA1 traffic to CDG1?
&lt;/h3&gt;

&lt;p&gt;Because CDG1, Paris, is the next-closest region to FRA1, Frankfurt, and Vercel's documented CDN behavior is to automatically reroute traffic to the next-closest region during regional trouble. When Frankfurt degraded, Vercel steered that region's CDN traffic to Paris to keep serving requests, then routed it back once Frankfurt recovered. This is platform-operated CDN rerouting that applies to everyone. It is not the same as automatic Vercel Function failover across regions, which is an Enterprise-only feature.&lt;/p&gt;

&lt;h3&gt;
  
  
  Could an uptime monitor have surfaced the FRA1 errors?
&lt;/h3&gt;

&lt;p&gt;Only if it was monitoring from a region that routed through Frankfurt. A single monitoring vantage that never touches the failing region reports 100 percent green, because nothing it sampled broke; the errors happened to a population it was not watching. Surfacing a single-region failure is a matter of coverage: run monitors from more than one region, including the regions your users come from, so a one-region degradation lands on a monitor that was sampling there. A browser login monitor run from that region walks the real sign-in path through that CDN.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did Velprove detect the June 2026 Vercel FRA1 incident?
&lt;/h3&gt;

&lt;p&gt;No, and this post makes no such claim. Velprove did not monitor Vercel's FRA1 region, and there is no detection-time lead to quote, because the incident opened on Vercel's status page at its own start time, leaving no published impact-versus-acknowledgment gap to measure against. What we can say is the failure shape, errors concentrated in one region you might never watch, is exactly what running monitors from more than one vantage is built to surface. Coverage surfaces such a failure; it does not prevent it.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Supabase Storage Latency, June 2026: US-West-2 Storage Requests Slowed ~9 Hours</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Mon, 15 Jun 2026 14:00:05 +0000</pubDate>
      <link>https://dev.to/velprove/supabase-storage-latency-june-2026-us-west-2-storage-requests-slowed-9-hours-5bb6</link>
      <guid>https://dev.to/velprove/supabase-storage-latency-june-2026-us-west-2-storage-requests-slowed-9-hours-5bb6</guid>
      <description>&lt;p&gt;&lt;strong&gt;The shape:&lt;/strong&gt; on June 12 2026, Supabase Storage requests in US-West-2 ran slow, not down, for about 8 hours 40 minutes, returning 200s the whole time. A check that asserts only a 200 status code stayed green through the entire window. The instrument that surfaces a regional 200-but-slow degradation is a response-time threshold on an HTTP monitor against a Storage-backed endpoint, set near your warm p95. Velprove ships response-time thresholds and 5 regions on every plan, so running the same probe from more than one region is what tells you the slowdown is the vendor's, not yours.&lt;/p&gt;

&lt;p&gt;Here is the one load-bearing fact to hold onto. A plain HTTP monitor asserting only &lt;code&gt;status_code = 200&lt;/code&gt; on a Storage URL would have reported 100 percent uptime through the entire window, because Storage kept returning 200s. It was just slow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke on June 12, 2026 (and what didn't)
&lt;/h2&gt;

&lt;p&gt;What was affected was narrow. Supabase named one product, Storage, and one region, US-West-2, and nothing else. The shape was a slowdown, not a stoppage: requests kept returning 200s throughout the window, and Supabase classified the incident as minor with the Storage component in degraded performance rather than an outage state. Operations were described as returning to normal once mitigation took hold, with a job-queue backlog draining afterward.&lt;/p&gt;

&lt;p&gt;What did not happen matters just as much. This was not a hard outage, and the vendor did not name any other Supabase product or any other region as affected. Supabase published no root cause; the only causal word in the record is "load," in the phrase "load mitigation strategies," and we quote that and stop. There is also a single source here, the Supabase status page, with no third-party corroboration, which is normal for a minor single-region blip and is stated so the post claims no more than the record supports.&lt;/p&gt;

&lt;h2&gt;
  
  
  The timeline (UTC, primary-source)
&lt;/h2&gt;

&lt;p&gt;The timeline below is taken verbatim from the Supabase Status incident permalink. All times UTC. Source: &lt;a href="https://status.supabase.com/incidents/zwmwcs0r4hll" rel="noopener noreferrer"&gt;Supabase Status incident zwmwcs0r4hll&lt;/a&gt; .&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Update&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12, 16:00&lt;/td&gt;
&lt;td&gt;Identified&lt;/td&gt;
&lt;td&gt;"Users in US-West-2 are seeing increased latency on storage requests. The team has identified the issue and is working on a solution."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12, 16:30&lt;/td&gt;
&lt;td&gt;Update&lt;/td&gt;
&lt;td&gt;"We are implementing load mitigation strategies to improve the performance of storage requests in US-West-2."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12, 17:29&lt;/td&gt;
&lt;td&gt;Update&lt;/td&gt;
&lt;td&gt;"The team is still working on a solution, and mitigation work is underway."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12, 18:28&lt;/td&gt;
&lt;td&gt;Update&lt;/td&gt;
&lt;td&gt;"We have implemented further mitigations and are tracking their progress. We continue to actively work this issue."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12, 19:30&lt;/td&gt;
&lt;td&gt;Update&lt;/td&gt;
&lt;td&gt;"We are seeing improvement in the performance of us-west-2 after implementation of our mitigation strategies."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12, 20:26&lt;/td&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;"Backlog is being effectively processed and operations have returned to normal. We continue to monitor the performance in us-west-2."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12, 23:57&lt;/td&gt;
&lt;td&gt;Update&lt;/td&gt;
&lt;td&gt;"Job queues continue to drain. We will actively monitor until they are complete."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 13, 00:40&lt;/td&gt;
&lt;td&gt;Resolved&lt;/td&gt;
&lt;td&gt;"All residual jobs have drained. Storage performance and job activity is normal."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The published machine timestamps run from 2026-06-12T16:00:44Z to 2026-06-13T00:40:18Z, which works out to approximately 8 hours 40 minutes. That hours figure is our derivation from the two published timestamps, not a number Supabase stated, and the window crossed midnight into June 13 UTC. Note the asynchronous tail: the request path returned to normal around 20:26 UTC, but the incident did not resolve until 00:40 UTC, because a job-queue backlog still had to drain after the latency recovered. Recovery of the request path was not the end of the incident.&lt;/p&gt;

&lt;p&gt;Supabase did not publish a root cause. The only causal hint is the vendor's own phrase "load mitigation strategies" in the 16:30 update. We quote that and stop. We do not extrapolate "load" into a traffic spike or a subsystem failure, and as of writing there is no postmortem at the incident URL; minor incidents typically do not get one. This is also a single source. We did not independently observe the latency, and there is no third-party corroboration, which is normal for a minor single-region blip and is stated here so the post claims no more confirmation than exists.&lt;/p&gt;

&lt;p&gt;One disambiguation. This is not the February 12 2026 us-east-2 incident, which was a hard regional outage with its own signed postmortem and a much wider blast radius. Different incident, different region, different severity. That one is covered in &lt;a href="https://velprove.com/blog/monitor-supabase-backed-app" rel="noopener noreferrer"&gt;the full Supabase monitoring guide&lt;/a&gt; ; keep the two distinct.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a status-code check stayed green
&lt;/h2&gt;

&lt;p&gt;The whole reason this incident is worth a teardown is the gap between what the status code said and what users felt. Storage endpoints returned success codes throughout. The degradation lived in response time, not in the status code. A monitor that only looks at the code on the front of the response saw a healthy 200 every single interval, while the actual user experience, a signed-URL fetch or a Storage REST read taking far longer than usual, was degraded the whole time.&lt;/p&gt;

&lt;p&gt;This is a textbook member of the silent-outage family: green status, degraded reality. We have written up the broader catalog in &lt;a href="https://velprove.com/blog/anatomy-of-a-silent-outage" rel="noopener noreferrer"&gt;the silent-outage taxonomy&lt;/a&gt; , green status and degraded reality. A 200-but-slow latency shift is the same family of problem as a 200-but-wrong-body or a 200-but-empty response: the status code is true and useless at the same time.&lt;/p&gt;

&lt;p&gt;The only external instrument that turns this from invisible into a red alert is a response-time threshold, set near your real warm p95. That is the bridge to the rest of this post: what to assert on, how to calibrate it, and how to use more than one region to tell a vendor slowdown apart from your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to monitor Supabase Storage for latency
&lt;/h2&gt;

&lt;p&gt;The primitive is a Velprove HTTP monitor pointed at a Storage-backed endpoint that exercises your real Storage path: a public object URL, a signed-URL fetch, or a Storage REST read. You are not probing some synthetic health route. You are hitting the same Storage surface your users hit, so the monitor feels the same latency they would.&lt;/p&gt;

&lt;p&gt;The load-bearing detail is the Success Conditions you set. Three work together here, and the order of importance is deliberate:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response time under a threshold.&lt;/strong&gt; This is the assertion that catches a 200-but-slow regional degradation. Set the threshold at roughly 1.5x to 2x your warm p95, measured over a real traffic window. This is the same calibration discipline the &lt;a href="https://velprove.com/blog/monitor-supabase-backed-app" rel="noopener noreferrer"&gt;Supabase monitoring guide&lt;/a&gt; uses for the Edge Function response-time tail. A threshold set too loose rides straight through a slow-but-under-threshold degradation; set too tight, it false-pages on normal variance. A default uptime check does not catch latency for free; the threshold is the catch. &lt;strong&gt;Status code equals 200.&lt;/strong&gt; This catches the rarer case where Storage fails hard rather than just slows down. It is the assertion most people start and stop with, and on its own it is exactly what stayed green through this incident. &lt;strong&gt;Body contains a known string.&lt;/strong&gt; Assert on a string the object or endpoint always returns, and the probe becomes a content-correctness check too, not just a timing and status check. Optional, but recommended, and the same pattern the guide uses for its Edge Function &lt;code&gt;body_contains&lt;/code&gt; assertion.&lt;/p&gt;

&lt;h3&gt;
  
  
  Calibrating the response-time threshold to warm p95
&lt;/h3&gt;

&lt;p&gt;The threshold is the whole monitor, so do not pull a round number from a docs page. Measure your warm p95 over a representative traffic window, the response time at the 95th percentile once caches are warm, then set the alert at about 1.5x to 2x that. The multiplier leaves room for normal variance while still tripping on a sustained shift like this one. If your warm p95 on a signed-URL fetch is 180ms, a threshold around 300ms to 360ms flags a real degradation without paging on jitter. The same discipline is laid out for Edge Functions in &lt;a href="https://velprove.com/blog/monitor-supabase-backed-app" rel="noopener noreferrer"&gt;the full Supabase monitoring guide&lt;/a&gt; , so the cluster reads consistently.&lt;/p&gt;

&lt;p&gt;One trap to avoid: do not put a Storage write or mutation on the monitor path. Point the probe at a read-only object or path, so a year of probes every few minutes does not quietly accumulate state in your bucket. A public object URL or a read-only signed-URL fetch is the right target.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using multiple regions to localize a regional slowdown
&lt;/h2&gt;

&lt;p&gt;Velprove offers 5 regions on every plan, but be precise about what that buys you: each check runs from one region you pick. A single monitor does not magically watch all five at once. So for a single-region incident like this one, the value of multiple regions is not coverage, it is differential diagnosis.&lt;/p&gt;

&lt;p&gt;Configure the same Storage probe from more than one region, which means more than one monitor. When the response time on the probe pathing toward US-West-2 climbs while probes from other vantages stay flat, you have localized the slowdown to the vendor's region rather than to your own app or network. That is a different and more useful signal than a single red monitor: it tells you where the problem is, not just that there is one. The mechanism requires more than one monitor, on a plan that allows it.&lt;/p&gt;

&lt;p&gt;This fits inside the free tier without a paid plan: 5 regions on every plan including Free, 10 monitors on Free, so a two- or three-vantage differential setup is well within reach. If your Storage read needs an auth step before the object fetch, multi-step API monitors are free up to 3 steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honesty boundary
&lt;/h2&gt;

&lt;p&gt;The strongest version of this post is the version that names what it does not claim.&lt;/p&gt;

&lt;p&gt;First, and most important: Velprove did not detect this incident. We did not monitor Supabase Storage in US-West-2, and we are not claiming we caught it or would have caught it as a matter of fact. Everything above is the failure shape that response-time monitoring is built to surface, presented as a worked example. If you read a detection into this post, that is more than the facts support.&lt;/p&gt;

&lt;p&gt;Second, there is no detection-time lead to claim here. Supabase opened this incident already in identified status at 16:00 UTC, the same minute the incident's own start timestamp records, so there is no published gap between impact start and acknowledgment to measure a lead against. We do not invent one.&lt;/p&gt;

&lt;p&gt;Third, detection is not prevention. A response-time monitor surfaces a slowdown faster and tells you it is upstream. It does not stop the slowdown.&lt;/p&gt;

&lt;p&gt;Fourth, the monitor catches this only if the response-time threshold is calibrated near real warm p95. A loose threshold rides through a slow-but-under-threshold degradation; a default check with no response-time assertion catches nothing here at all.&lt;/p&gt;

&lt;p&gt;Fifth, one region per check. The differential-diagnosis value above requires configuring the same probe from more than one region; a single monitor watches a single vantage.&lt;/p&gt;

&lt;p&gt;Sixth, we do not know what broke inside Supabase. The only word the vendor published is "load," in the phrase "load mitigation strategies." We quote it and stop, with no speculation about cause, no quantitative latency figures, and no affected-customer counts, because none were published.&lt;/p&gt;

&lt;h2&gt;
  
  
  This pattern, not just this incident
&lt;/h2&gt;

&lt;p&gt;A regional 200-but-slow latency shift on a managed surface, Storage here, is a recurring shape, not a one-off. The instrument outlives this particular blip: a response-time threshold calibrated to warm p95, plus a multi-region differential to localize it. Build that once and it is ready the next time a managed dependency slows in one region while its status code keeps saying 200.&lt;/p&gt;

&lt;p&gt;It reads as a pair with a teardown we published recently. In &lt;a href="https://velprove.com/blog/github-actions-may-2026-detection-teardown" rel="noopener noreferrer"&gt;the GitHub Actions May 2026 detection teardown&lt;/a&gt; , the same underlying lesson shows up on a different surface and a different failure shape: there the endpoint returned 200 with a body that told the truth the status code hid, so the catching primitive was a body assertion. Here the body is fine and the truth is in the timing, so the catching primitive is a response-time threshold. Same family in &lt;a href="https://velprove.com/blog/anatomy-of-a-silent-outage" rel="noopener noreferrer"&gt;the silent-outage catalog&lt;/a&gt; , green status and degraded reality, different assertion to catch it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor your Supabase Storage
&lt;/h2&gt;

&lt;p&gt;A 200-but-slow regional degradation like this June 12 incident is invisible to a status-code-only check and visible to a response-time threshold set near your warm p95. Velprove's free plan covers the setup: 10 monitors, 5 regions on every plan, response-time thresholds on every plan, and no credit card; if your Storage read needs an auth step first, multi-step API monitors are free up to 3 steps. Point an HTTP monitor at a read-only Storage object, add the response-time threshold, and the next regional slowdown surfaces on a monitor instead of in a support ticket.&lt;/p&gt;

&lt;p&gt;This post owns the Storage angle. For the full Supabase monitoring set, Auth, RLS, Edge Functions, and Realtime, see &lt;a href="https://velprove.com/blog/monitor-supabase-backed-app" rel="noopener noreferrer"&gt;the complete Supabase monitoring guide&lt;/a&gt; . &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start free&lt;/a&gt; and point a response-time monitor at your own Storage surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What happened during the Supabase Storage latency incident in June 2026?
&lt;/h3&gt;

&lt;p&gt;Storage requests in US-West-2 ran slowly between 2026-06-12 16:00 UTC and 2026-06-13 00:40 UTC, approximately 8 hours 40 minutes, with the window crossing midnight UTC. Supabase classified it minor and the Storage component sat in degraded performance, not outage. Requests were served, just slowly, and a job-queue backlog drained after the request path recovered. Only the Storage surface and only the US-West-2 region were named. Supabase did not publish a root cause; the only stated mitigation was load mitigation strategies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Was Supabase down during the June 2026 Storage incident?
&lt;/h3&gt;

&lt;p&gt;No. This was a latency degradation, not an outage. Storage requests continued to return successfully throughout; they were slower than normal in US-West-2. Other Supabase products and other regions were not named as affected. Do not conflate it with the separate February 12 2026 us-east-2 incident, which was a hard regional outage with its own postmortem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can an uptime monitor catch a Supabase Storage latency degradation?
&lt;/h3&gt;

&lt;p&gt;Only if it asserts on response time, not just the status code. A check that asserts &lt;code&gt;status_code = 200&lt;/code&gt; alone stays green through a 200-but-slow degradation. Add a &lt;code&gt;response_time_ms&lt;/code&gt; assertion set at roughly 1.5x to 2x your warm p95 on an HTTP monitor against a Storage-backed endpoint, and a sustained shift above that threshold trips an alert. Calibration is the catch: too loose and it rides through, too tight and it false-pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did Velprove detect the June 2026 Supabase Storage slowdown?
&lt;/h3&gt;

&lt;p&gt;No, and this post makes no such claim. Velprove did not observe this incident, and there is no detection-time-lead figure to quote, because Supabase opened the incident already identified, at the same minute it began, so there is no published impact-versus-acknowledgment gap. What we can say is the failure shape, a regional 200-but-slow latency degradation, is exactly what a response-time-threshold monitor is built to surface, and what a status-code-only check is not.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I localize a Storage slowdown to Supabase's region instead of my own app?
&lt;/h3&gt;

&lt;p&gt;Run the same Storage probe from more than one Velprove region. Each Velprove check runs from one region you pick out of five, so a single monitor cannot watch all five at once, but configuring the same Storage HTTP monitor from two or three vantages turns it into a differential test. When the response time climbs on the probe pathing through the affected region while the others stay flat, the slowdown is localized to the vendor's region, not to your network or app.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>GitHub API Authentication Failures, June 2026: Monitoring Authenticated Endpoints</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Mon, 15 Jun 2026 14:00:02 +0000</pubDate>
      <link>https://dev.to/velprove/github-api-authentication-failures-june-2026-monitoring-authenticated-endpoints-fm8</link>
      <guid>https://dev.to/velprove/github-api-authentication-failures-june-2026-monitoring-authenticated-endpoints-fm8</guid>
      <description>&lt;p&gt;&lt;strong&gt;The trap:&lt;/strong&gt; on June 10 2026, valid GitHub credentials started drawing erroneous 401s on what GitHub reported as approximately 15% of API traffic, an incident GitHub ran from 15:20 to 16:39 UTC. The structural problem is that the people most likely to be watching cannot see it the easy way: an anonymous ping has no auth step to fail, and a status-only probe treats a 401 as a clean, valid HTTP status and stays green. The only instrument that catches an auth-plane failure is one that authenticates, calls a protected endpoint, and asserts the &lt;em&gt;authenticated&lt;/em&gt; response came back, a 200 and not a 401. Velprove's free multi-step API monitor does exactly that: authenticate, call a protected endpoint, assert 200 and an auth-only field, so green means the authenticated response actually returned. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start free&lt;/a&gt;, no credit card required.&lt;/p&gt;

&lt;h2&gt;
  
  
  When a 401 is the failure, not the gate
&lt;/h2&gt;

&lt;p&gt;Most of the time a 401 is the system working. You called a protected endpoint without valid credentials, the server refused you, and it said so with the correct status code. That is the gate doing its job. The June 10 incident is the other kind of 401, the one where the gate is broken: you sent valid credentials and the server rejected you anyway. GitHub's own words were "erroneous 401 responses." The request was turned away as unauthenticated even though the token was correct. A 200 where you expected a 200, flipped to a 401.&lt;/p&gt;

&lt;p&gt;The reason this is so easy to miss is that the failure surfaces as a clean, valid HTTP status, not a 500 and not a timeout. Nothing crashed. Nothing hung. The server answered promptly and politely with the wrong answer. So the two cheapest ways to watch an API both sail straight through it. An anonymous ping that hits a public landing page or an unauthenticated endpoint never exercises the auth path at all, because there is no auth step in it to fail. And a status-only probe that calls a real endpoint but treats anything that is not a 5xx as "up" reads the 401 as a perfectly valid response and stays green for the whole window. This is a textbook member of the silent-outage family we catalog in &lt;a href="https://velprove.com/blog/anatomy-of-a-silent-outage" rel="noopener noreferrer"&gt;the anatomy of a silent outage&lt;/a&gt; and explain at length in &lt;a href="https://velprove.com/blog/why-uptime-monitors-miss-outages" rel="noopener noreferrer"&gt;why uptime monitors miss outages&lt;/a&gt; : the internal signal looks healthy while a specific population, here every integration holding a valid token, is being locked out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why "200 OK" is the wrong question for an authenticated endpoint
&lt;/h3&gt;

&lt;p&gt;For a public, no-auth endpoint, "did it return 200?" is a reasonable first question. For an authenticated endpoint it is the wrong question, because the endpoint can return a non-200 that is itself a correct, valid status, a 401 or a 403, and a status-only check cannot tell "refused you correctly" apart from "refused you erroneously." The question that actually matters is narrower: did the &lt;em&gt;authenticated&lt;/em&gt; response come back? Not "did something respond," but "did the response that only an authorized caller is supposed to get come back?" That is the question an erroneous-401 incident answers with a no, and it is the question your monitor has to be built to ask.&lt;/p&gt;

&lt;h2&gt;
  
  
  The timeline (UTC, primary-source)
&lt;/h2&gt;

&lt;p&gt;The timeline below is taken verbatim from the GitHub Status incident permalink. All times UTC. Source: &lt;a href="https://www.githubstatus.com/incidents/fcj3088jg1wx" rel="noopener noreferrer"&gt;GitHub Status incident fcj3088jg1wx&lt;/a&gt; .&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;15:20&lt;/td&gt;
&lt;td&gt;Investigating. "We are investigating reports of impacted performance for some GitHub services."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:23&lt;/td&gt;
&lt;td&gt;Investigating. "API Requests is experiencing degraded availability. We are continuing to investigate."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:27&lt;/td&gt;
&lt;td&gt;Investigating. "Issues is experiencing degraded performance. We are continuing to investigate."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:27&lt;/td&gt;
&lt;td&gt;Investigating. "We are investigating issues related to sporadic authentication failures impacting approximately 15% of API traffic. We will continue to investigate and provide updates."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:46&lt;/td&gt;
&lt;td&gt;Investigating. "We continue to investigate issues related to sporadic authentication failures, impacting approximately 15% of API traffic. Further updates will be provided as we work to mitigate."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15:46&lt;/td&gt;
&lt;td&gt;Investigating. "The degradation affecting Issues has been mitigated. We are monitoring to ensure stability."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:21&lt;/td&gt;
&lt;td&gt;Investigating. "We continue to investigate issues related to sporadic authentication failures, impacting approximately 15% of API traffic. Erroneous 401 responses are causing app integrations to trigger authentication flows. We have identified a problematic component in our infrastructure and are working to mitigate."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:36&lt;/td&gt;
&lt;td&gt;Investigating. "The degradation affecting API Requests has been mitigated. We are monitoring to ensure stability."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:37&lt;/td&gt;
&lt;td&gt;Monitoring. "The degradation has been mitigated. We are monitoring to ensure stability."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16:39&lt;/td&gt;
&lt;td&gt;Resolved. "This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GitHub classified the incident as critical. Start to resolution, the window ran 15:20 to 16:39 UTC, about 1 hour 18 minutes. Two components were named: API Requests and Issues. Issues was mitigated first, at 15:46, with API Requests mitigated about 50 minutes later at 16:36. The resolution update promised a root cause analysis when available; as of writing, none had been published. The only thing GitHub said about cause was that it had "identified a problematic component in our infrastructure," so this post quotes that and does not speculate beyond it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke (and what GitHub did not say broke)
&lt;/h2&gt;

&lt;p&gt;Here is the scope, in GitHub's own framing and no further.&lt;/p&gt;

&lt;p&gt;What broke: authenticated API requests. GitHub described it as "sporadic authentication failures impacting approximately 15% of API traffic," and at 16:21 named the symptom precisely: "Erroneous 401 responses are causing app integrations to trigger authentication flows." That last line is the whole shape of the incident in one sentence. Valid credentials drew 401s, and the integrations holding those credentials reacted the way they are designed to react to a 401, by trying to re-authenticate, which did not help because the credentials were never the problem. Issues was also listed as degraded and was the first component mitigated, at 15:46.&lt;/p&gt;

&lt;p&gt;A few things this post deliberately does not assert, because GitHub did not publish them. There is no claim about git operations, clone, push, or pull. There is no claim about OAuth login flows for end users, about GitHub Actions, about webhooks, or about Packages, because none of those were named as affected components, only API Requests and Issues were. The 16:21 mention of "app integrations" is a symptom of the 401s, not a separately scoped component. There was no regional scoping published; GitHub framed the impact as a share of API traffic, not a geography. And there is no root cause, because none was disclosed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The one number, stated carefully
&lt;/h3&gt;

&lt;p&gt;GitHub gave exactly one quantified figure: "approximately 15% of API traffic." That is GitHub's own characterization, repeated across three updates, and it should always be attributed to GitHub, not presented as anyone else's measurement. The word doing the most work next to it is "sporadic." A sporadic failure on a share of traffic is not a steady, every-request break. It means a monitor sampling at an interval could plausibly have seen a red on some polls and a green on others during the window. That detail matters for the honesty boundary later, and it is the reason this post does not build a tidy detection-time story off the 15%: a sporadic, partial failure does not give you a clean, deterministic moment to point at.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Velprove monitors GitHub's authenticated API
&lt;/h2&gt;

&lt;p&gt;The natural instrument for this failure shape is a free multi-step API monitor, and the reason it fits is that an auth-plane failure has more than one moving part to check. A multi-step monitor lets you chain requests and carry a result from one step into the next, and the free plan covers up to three steps, which is exactly enough for the shape that catches an erroneous 401: authenticate, call a protected endpoint, assert the authenticated response.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: authenticate
&lt;/h3&gt;

&lt;p&gt;The first step performs the auth step, sending a valid token so the monitor is acting as a real authenticated caller rather than an anonymous one. Use a dedicated, low-privilege monitoring token, never a real account's credentials and never a token with more scope than a read of one protected endpoint needs. The point of this step is to establish the authenticated identity the rest of the check depends on, and to confirm the auth step itself succeeds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: call a protected endpoint with the token
&lt;/h3&gt;

&lt;p&gt;The second step calls a protected endpoint with that token, the kind of authenticated &lt;code&gt;api.github.com&lt;/code&gt; endpoint that returns account-scoped or token-scoped data, the data an anonymous caller could never get. This is the request that an erroneous 401 incident actually breaks, so it is the request your monitor has to make, with the token attached, exactly the way your integration makes it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: assert the authenticated response
&lt;/h3&gt;

&lt;p&gt;The third step is where green earns its meaning. You assert two things together. First, the status is 200 and not 401, which catches the bare erroneous-401 case directly. Second, a content match on a field that only appears for an authenticated, authorized response, an account-scoped field the anonymous version of the endpoint never returns. The two together are stronger than either alone: the status assertion catches the clean 401, and the content assertion catches the subtler case where the endpoint returns a 200 with a stripped-down, unauthenticated-looking body. When valid credentials start drawing erroneous 401s, this is the assertion that flips the monitor red, while a bare GET with no token, or an "any non-5xx is up" rule, would stay green.&lt;/p&gt;

&lt;p&gt;The mechanics of building the chain, carrying the token from the auth step into the protected call, are covered step by step in &lt;a href="https://velprove.com/blog/multi-step-api-monitoring-guide" rel="noopener noreferrer"&gt;the multi-step API monitoring guide&lt;/a&gt; , so this post points there rather than re-teaching them. And for why an auth-plane failure is invisible to a service's own &lt;code&gt;/healthz&lt;/code&gt; self-report, and the bearer-token protected-route pattern in general, see &lt;a href="https://velprove.com/blog/api-health-check-patterns" rel="noopener noreferrer"&gt;API health check patterns&lt;/a&gt; . If you also want the reachability check on the public health endpoint alongside the authenticated probe, that is &lt;a href="https://velprove.com/blog/monitor-rest-api-health-endpoint" rel="noopener noreferrer"&gt;monitoring a REST API health endpoint&lt;/a&gt; .&lt;/p&gt;

&lt;h2&gt;
  
  
  The honesty boundary
&lt;/h2&gt;

&lt;p&gt;The strongest version of this post is the version that names what it does not claim.&lt;/p&gt;

&lt;p&gt;First, and most important: Velprove did not detect this incident. We did not monitor GitHub. Everything above is the failure shape that a multi-step authenticated API monitor is built to catch, presented as a worked example, not as a claim that we caught it.&lt;/p&gt;

&lt;p&gt;Second, a sporadic failure on a share of traffic is not guaranteed to trip on any given poll. GitHub described this as "sporadic" and as affecting "approximately 15% of API traffic," which means a monitor sampling at an interval could have caught a red on some polls and a green on others during the window. An external monitor demonstrates the shape is catchable; it does not guarantee it would have flagged this specific partial, sporadic incident on any single poll. We say that plainly rather than implying clean detection.&lt;/p&gt;

&lt;p&gt;Third, an authenticated monitor sees only the endpoint and the identity you point it at. It does not certify the whole API, all token types, or all of your integrations. It tells you whether the one authenticated call you chose still returns the authenticated response, which is enough to surface this class of failure but is not a blanket health claim about everything behind the login.&lt;/p&gt;

&lt;p&gt;Fourth, the content assertion is only as good as the field you pick. Assert on a stable field that an authorized response genuinely always contains, not on an optional or A/B-tested field that can legitimately be absent. A brittle field produces false reds; a well-chosen one is what separates "the authenticated response came back" from "something returned 200."&lt;/p&gt;

&lt;p&gt;Fifth, on cause: GitHub said only that it had "identified a problematic component in our infrastructure" and that a detailed analysis would follow. None had been published at the time of writing. Anything more specific than GitHub's own words would be guessing, so we quote them and stop there.&lt;/p&gt;

&lt;h2&gt;
  
  
  This pattern, not just this incident
&lt;/h2&gt;

&lt;p&gt;This incident is the authenticated-plane mirror of one we tore down a few days earlier. In &lt;a href="https://velprove.com/blog/github-signed-out-outage-june-2026" rel="noopener noreferrer"&gt;the signed-out mirror of this incident&lt;/a&gt; , the June 8 break lived on the anonymous public path: signed-out visitors lost Pull Requests and Issues while authenticated sessions stayed green, and the right instrument was an unauthenticated content monitor on a public URL. This June 10 incident is the exact reverse. The authenticated path itself failed, valid credentials drawing erroneous 401s, while an anonymous probe would have seen nothing, because an anonymous probe never touches the auth plane. Two sides of the login, the same silent-outage family: one breaks for the signed-out world while the signed-in world is fine, the other breaks for the holders of valid credentials while an anonymous caller is fine.&lt;/p&gt;

&lt;p&gt;There is a third GitHub teardown in the same family, and it is worth naming precisely so it does not read as overlap. In &lt;a href="https://velprove.com/blog/github-actions-may-2026-detection-teardown" rel="noopener noreferrer"&gt;the GitHub Actions May 2026 teardown&lt;/a&gt; , a different authenticated-API surface failed: the orchestration field itself was stuck behind an otherwise healthy 200, a workflow run that kept reporting the wrong state. That is a different failure layer from this one. There, the response came back, the status was fine, and the break was in a value inside the body. Here the break is one layer earlier, in the auth plane: the response that came back was a 401, the request never got far enough to return a body to inspect. Same vendor, same authenticated API, two distinct layers, the auth plane rejecting a valid credential versus an orchestration field stuck behind a 200.&lt;/p&gt;

&lt;p&gt;The generalized lesson is for any product with a token-authenticated API, not just GitHub. The vendor API your product consumes, or the API you expose to your own customers, can have its auth plane fail, valid tokens drawing erroneous 401s, while a ping on the public surface stays perfectly green. If that is the shape of your stack, &lt;a href="https://velprove.com/for/api" rel="noopener noreferrer"&gt;Velprove for API monitoring&lt;/a&gt; and &lt;a href="https://velprove.com/for/saas" rel="noopener noreferrer"&gt;Velprove for SaaS&lt;/a&gt; are built around exactly this gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What were the GitHub API authentication issues on June 10 2026?
&lt;/h3&gt;

&lt;p&gt;GitHub ran an incident from 15:20 UTC to 16:39 UTC on June 10 2026, roughly 1 hour 18 minutes. GitHub described it as sporadic authentication failures impacting approximately 15% of API traffic, with erroneous 401 responses on API Requests; Issues was also briefly listed as degraded and was mitigated first. GitHub classified the incident as critical. The 15% figure is GitHub's own published characterization. The root cause was not disclosed; GitHub said only that it had identified a problematic component in its infrastructure and that a detailed analysis would be shared when available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Was the GitHub API down on June 10 2026?
&lt;/h3&gt;

&lt;p&gt;No, not fully. GitHub characterized it as sporadic authentication failures on approximately 15% of API traffic, which means by GitHub's own number roughly 85% of API traffic was unaffected, over a window of about 1 hour 18 minutes. That is a partial, sporadic degradation of authenticated requests, not a full API outage. GitHub itself did not use the word outage for this incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is an erroneous 401, and why doesn't a normal uptime check catch it?
&lt;/h3&gt;

&lt;p&gt;An erroneous 401 is an unauthorized response returned on a valid credential. The request was rejected as unauthenticated even though the token was correct, which is what GitHub reported on June 10 2026. A normal uptime check misses it for two reasons. An anonymous ping has no auth step to fail, so it never exercises the path that broke. And a status-only probe treats 401 as a clean, valid HTTP status rather than an error, so it stays green. You only catch this class of failure by asserting that the authenticated response actually came back: status 200, not 401, plus a field only an authorized response contains.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do you monitor an authenticated API endpoint for auth failures?
&lt;/h3&gt;

&lt;p&gt;Use a multi-step API monitor. Step one authenticates, sending a valid test token or performing the auth step. Step two calls a protected endpoint with that token. Step three asserts the authenticated response came back: status 200 rather than 401, plus a content match on a field that only appears for an authenticated, authorized call. When valid credentials start drawing erroneous 401s, that third assertion flips the monitor red. Velprove's free plan covers the three-step shape, which is exactly enough for authenticate, call, assert.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did Velprove detect the GitHub API authentication failures?
&lt;/h3&gt;

&lt;p&gt;No. Velprove did not monitor GitHub and did not detect this incident. This post uses it as a worked example of the failure shape an authenticated API monitor is built to catch, not as a claim that Velprove caught it. There is an extra honesty point specific to this incident: GitHub described the failures as sporadic and affecting approximately 15% of API traffic, so even the right monitor is not guaranteed to trip on any given poll during a partial, sporadic failure. The monitor demonstrates the shape is catchable; it does not guarantee it flagged this specific partial incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does GitHub have an SLA for an API authentication incident like this?
&lt;/h3&gt;

&lt;p&gt;Consumer plans (Free, Pro, Team) do not carry a contractual SLA, and a roughly 1 hour 18 minute partial degradation is unlikely to breach a typical 99.9 percent monthly threshold even where an SLA applies. The operational angle here is visibility, knowing your integration started getting erroneous 401s and that the cause was upstream, not the service credit. For the deeper SLA-credit math, see &lt;a href="https://velprove.com/blog/sla-vs-slo-vs-sli-customer-guide" rel="noopener noreferrer"&gt;the SLA-vs-SLO-vs-SLI breakdown&lt;/a&gt; .&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor your authenticated API
&lt;/h2&gt;

&lt;p&gt;The trap is that an auth failure looks like the system working. A clean, valid 401 sails through a status-only probe, and an anonymous ping never touches the auth plane at all, so the one failure that locks out every holder of a valid token is the one your cheapest checks are built to miss. The fix is a monitor that authenticates, calls a protected endpoint, and asserts the authenticated response came back, a 200 and not a 401, plus a field only an authorized call returns.&lt;/p&gt;

&lt;p&gt;Velprove's free multi-step API monitor covers exactly this shape: up to three steps, which is all it takes to authenticate, call, and assert; ten monitors; five regions on every plan; and commercial use allowed, no credit card required. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Try Velprove&lt;/a&gt; and point a multi-step monitor at the authenticated API your product depends on. If your stack is API-shaped, start with &lt;a href="https://velprove.com/for/api" rel="noopener noreferrer"&gt;Velprove for API monitoring&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Monitor a Mobile App Backend: The API Your iOS and Android App Calls</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Wed, 10 Jun 2026 14:00:03 +0000</pubDate>
      <link>https://dev.to/velprove/monitor-a-mobile-app-backend-the-api-your-ios-and-android-app-calls-5g5f</link>
      <guid>https://dev.to/velprove/monitor-a-mobile-app-backend-the-api-your-ios-and-android-app-calls-5g5f</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; To monitor a mobile app backend, point monitors at the exact endpoints your iOS or Android app calls (login and token, sync, push-token registration, version gate), not at your App Store listing or marketing site, because a native app is a thin client and those pages stay green through a total backend outage. Velprove does this with a free, no-code multi-step API monitor on those endpoints, plus its free no-code browser login monitor if you have a companion web sign-in at the same domain. The honest boundary: Velprove cannot drive your native app. It probes the endpoints externally from a region you pick, which is precisely the backend a crash reporter like Crashlytics goes quiet on when the app is fine but the server is down. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start free&lt;/a&gt;, no credit card required.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Your App Store listing is green. Your backend is down. Every user sees a spinner.
&lt;/h2&gt;

&lt;p&gt;Your homepage returns 200. Your App Store listing is five stars. And every person who opens your app right now is staring at a spinner, because the one endpoint your app actually calls, the login and token endpoint, is down, and nothing on your status dashboard knows. The app binary on the phone is fine. It compiled, it shipped, it installed. What broke is the server it talks to, and the listing page that says everything is fine is served by Apple or Google, not by you.&lt;/p&gt;

&lt;p&gt;One honest thing up front, because it changes which monitor you reach for. Velprove's headline free monitor is the no-code browser login monitor, and it is the differentiator: it opens a real browser and signs into your own login the way a person does, with no code to write. But it drives a real web browser, not a native app. It cannot tap through your native iOS or Android UI. For a native app, the workhorse is the multi-step API monitor, which hits the same backend endpoints the app calls. The browser login monitor still earns its place here, but only if you also run a companion web sign-in at the same domain. If you do, layer it on top. If you do not, the API monitors below are the whole job.&lt;/p&gt;

&lt;p&gt;This is the failure that sends people looking for this post. The marketing site renders. The listing is green. Support is filling up with "the app just spins" and you have no signal that says why, because the only thing you were watching was a page Apple serves. Slack's own engineering post-mortem of its February 22, 2022 incident describes exactly this shape: the incident began just after 6 a.m. Pacific Time and, in their words, &lt;a href="https://slack.engineering/slacks-incident-on-2-22-22/" rel="noopener noreferrer"&gt;many users were unable to connect to Slack&lt;/a&gt; . The app on every phone was untouched. The backend behind it was overwhelmed. That is the thin-client truth, and it is the whole reason this post exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. A native app is a thin client. The product is the API behind it.
&lt;/h2&gt;

&lt;p&gt;Walk through what your app actually does on launch. It calls a login or token endpoint. It calls a sync or feed endpoint to pull the user's data. It POSTs its push-notification device token to a registration endpoint. It calls a version-check endpoint to decide whether the installed build is still allowed. Every one of those is an HTTP call to your backend. The app itself is a renderer. The product, the part that can actually go dark, lives on your server.&lt;/p&gt;

&lt;p&gt;Now look at what a basic uptime check watches. A homepage &lt;code&gt;GET / → 200&lt;/code&gt; tells you the marketing site renders. It does not exercise the auth endpoint, the sync endpoint, the push-registration endpoint, or the version-gate endpoint, which are the endpoints the app calls. App Store and Play Store listing pages are served by Apple and Google, not by your backend, so they stay green through a total backend outage. There is no Apple or Google surface that shows you your backend is down, because your backend is not their service. The absence of that signal is the point.&lt;/p&gt;

&lt;p&gt;This is the same structural problem behind every silent outage: an internal or adjacent signal looks green while the real path users take is broken. We have written about &lt;a href="https://velprove.com/blog/why-uptime-monitors-miss-outages" rel="noopener noreferrer"&gt;why a 200 OK is not the product&lt;/a&gt; in general. The native-app case is one of the sharpest instances of it, because the green you are looking at is not even your server. It is a listing page Apple or Google renders, and it has no idea your API is failing.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. This is not crash reporting. Crashlytics, Sentry, and Datadog RUM watch a different thing.
&lt;/h2&gt;

&lt;p&gt;Before the failure modes, draw the line clearly, because "mobile monitoring" means two different things and blurring them wastes money. Firebase Crashlytics is, in Firebase's own words, a lightweight, realtime crash reporter that helps you track, prioritize, and fix stability issues, and on Android it reports crashes, non-fatal errors, and Application Not Responding (ANR) errors. Sentry and Datadog RUM sit in the same category: they instrument the client, capture crashes and on-device performance, and send that telemetry up from the user's phone.&lt;/p&gt;

&lt;p&gt;Velprove is not that, and it is not trying to be. It runs no code on the user's device. It does not embed in your app, it does not capture crashes, and it does not watch on-device performance. It probes your backend endpoints from outside, from a region you pick. That is a different and complementary concern. Here is the case that makes the difference concrete: a backend that starts returning clean HTTP 503s causes spinners and login failures for every user, with zero crashes. The app handled the error gracefully. So a crash reporter stays completely quiet, because nothing crashed, while your entire install base is locked out.&lt;/p&gt;

&lt;p&gt;So run both. Crashlytics or Sentry catch crashes inside the app and go quiet when the app is fine but your backend is down. Velprove watches the backend the app depends on and turns red exactly when those clean 503s start. They cover opposite halves of the same product. Neither replaces the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Failure mode 1: the long-lived token refresh that logs out your whole install base.
&lt;/h2&gt;

&lt;p&gt;A native app does not ask the user to type a password every launch. It stores a long-lived token, and on launch it quietly calls a refresh or token endpoint to trade that for a fresh access token. When the refresh endpoint breaks, the app cannot get a valid token, so it falls back to the only thing it can do: it logs the user out and shows the sign-in screen. Now your entire install base is staring at a login screen at once, while your marketing site and any web dashboard stay green, because neither of them exercises that endpoint.&lt;/p&gt;

&lt;p&gt;This is the case the multi-step API monitor was built for, because a single status ping on the login endpoint is not enough. A login endpoint can return 200 and still hand back a token that does not actually authorize anything. You need to use the token. In prose, the monitor does two steps. Step 1: POST your login or token endpoint with a dedicated test account's credentials, assert the status is 200, and capture the returned token into a variable. Step 2: GET a protected endpoint, the kind the app calls right after sign-in, send that captured token as a Bearer header, and assert the response contains that test user's own data. Now a pass means the full chain works: the token was issued and it actually unlocked protected data. A break anywhere in that chain turns the monitor red.&lt;/p&gt;

&lt;p&gt;The mechanics of chaining one step into the next, extracting the token and replaying it, are covered in depth in &lt;a href="https://velprove.com/blog/multi-step-api-monitoring-guide" rel="noopener noreferrer"&gt;the multi-step API monitor mechanism&lt;/a&gt; guide. This post is about pointing that mechanism at the specific endpoints a native app depends on, so go there for the step-by-step build and stay here for which endpoints matter and why.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Failure mode 2: the push-token registration endpoint (APNs / FCM) that dies silently.
&lt;/h2&gt;

&lt;p&gt;Push notifications have a leg that runs through your backend, and it is the quietest failure on this list. The flow is: the app registers with Apple Push Notification service or Firebase Cloud Messaging, the platform returns a device token, and then the app forwards that token to your server. Apple documents the app forwarding its token to the app's associated provider. Firebase is blunter: it tells you to retrieve registration tokens and store them on your server, and strongly recommends the app save the token to your app server alongside a timestamp at startup. That "send the token to your backend" leg is an endpoint you own.&lt;/p&gt;

&lt;p&gt;When that registration endpoint starts returning 500s, nothing visible breaks. The app launches. The user scrolls. No crash, no error dialog, nothing in a crash reporter, because the app did its part and the registration POST just failed silently in the background. But new installs and token refreshes stop landing in your database, so those users quietly fall off your push list. You find out days later when a campaign goes out and the delivered count is wrong, by which point you have lost the window.&lt;/p&gt;

&lt;p&gt;Monitor it directly. Point an API monitor at your registration endpoint, have it POST a sentinel device token the way the app would, assert the status is 200, and add a body assertion on the confirmation your endpoint returns when it has stored the token. Now the moment the endpoint starts 500-ing, the monitor goes red, instead of you discovering it from a low delivery count two campaigns later.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Failure mode 3: the force-update / min-version gate that can lock out everyone.
&lt;/h2&gt;

&lt;p&gt;There is no native API to force users to update on iOS. Apps implement the force-update gate themselves: on launch the app calls a backend version-check endpoint, or the public iTunes Search API, gets back a minimum-allowed version, and if the installed build is too old it shows a blocking "please update" screen. Open-source helpers like Siren do the comparison on the client, and Firebase Remote Config is sometimes used to flip a flag, with Firebase noting it lets you change behavior without publishing an app update. However you wire it, the decision hinges on a response from an endpoint you own.&lt;/p&gt;

&lt;p&gt;That makes the version-check endpoint a backend dependency with an unusually nasty blast radius, because it gates launch itself. If it returns garbage or 500s, you get one of two bad outcomes. The app may hard-gate every user out, treating a broken response as "you must update" and showing the blocking screen to your entire install base, including people on the current version. Or it fails open, lets every client through regardless of version, and now incompatible old builds are hitting an API you have already changed. Either way, an endpoint nobody thinks about controls whether anyone can use the app.&lt;/p&gt;

&lt;p&gt;Monitor it like the load-bearing dependency it is. Put an API monitor on the version-check endpoint, assert the status is 200, and add a body assertion on the expected minimum-version payload shape, so a malformed or empty response trips the monitor before it trips your users. The mechanics of asserting on a JSON body, pulling a specific field and checking its shape, are covered in &lt;a href="https://velprove.com/blog/monitor-rest-api-health-endpoint" rel="noopener noreferrer"&gt;the guide on asserting on a health endpoint's JSON body&lt;/a&gt; . Apply the same body-assertion technique to the version-gate response.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Failure mode 4: cert pinning, where the endpoint is green and the app is bricked.
&lt;/h2&gt;

&lt;p&gt;Certificate pinning is the trap that breaks a perfectly healthy endpoint for every pinned client at once. A pinned app trusts only a specific certificate or key, not any valid CA. When you rotate the certificate, the new one does not match the pin baked into the shipped app, so the app fails TLS and cannot connect, while every browser and every generic HTTPS monitor sees a valid cert and stays green. Android's own docs warn that future server configuration changes, such as changing to another CA, render apps with pinned certificates unable to connect to the server without receiving a client software update. Cloudflare puts it the same way: pinning carries outage risk, and clients pinned to the old key will fail TLS each time a certificate rotates.&lt;/p&gt;

&lt;p&gt;This is the failure where you have to be honest about what an external monitor can and cannot do, because it is easy to overclaim here. Velprove's SSL certificate monitoring, toggled on an HTTPS monitor, reads the live leaf certificate and warns you as expiry approaches. And the HTTPS monitor's own TLS handshake fails the moment a rotated certificate stops validating, which is your cue to release the updated pin in a new app build before the rotation strands clients. What it cannot do is reach inside your installed app and confirm the app's pin set was actually updated. Only releasing and testing the client build proves that. So treat it as a rotation tripwire, not as proof your pinned clients are fine.&lt;/p&gt;

&lt;p&gt;The practical posture: if you pin, SSL certificate monitoring on that HTTPS monitor earns its slot as an early warning that a rotation is about to (or did) happen, so you can get an updated build out. That is genuinely useful and worth doing. Just do not read a green certificate result as "the pinned app connects." Those are two different claims, and the gap between them is exactly where pinned-app outages live.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Which monitor for which endpoint: a decision section.
&lt;/h2&gt;

&lt;p&gt;You have four failure modes, and three monitor types do the work here. Here is the map, so you are not guessing which tool to point at which endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-step API monitor&lt;/strong&gt; for anything that needs a token to prove it works: the auth and token-refresh flow, and any protected endpoint the app calls right after sign-in. Sign in, capture the token, replay it against a protected endpoint, assert on the user's own data. This is the one that catches the logged-out-install-base failure, because it proves the token actually authorizes something.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single API monitor&lt;/strong&gt; for endpoints you can check with one request: the push-token registration endpoint, the version-check gate, and any plain health endpoint. Status assertion plus a body assertion on the response the endpoint returns. These are the silent-failure endpoints, the ones that 500 without anyone noticing, so they are exactly the ones worth a standing monitor. The body-assertion mechanics carry over from &lt;a href="https://velprove.com/blog/api-health-check-patterns" rel="noopener noreferrer"&gt;the API health-check patterns&lt;/a&gt; guide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Browser login monitor&lt;/strong&gt; only if there is a companion web sign-in at the same domain. It opens a real browser and signs in the way a person does, no code to write, which is the right tool for the human web path but the wrong tool for a native app, because it drives a browser, not your iOS or Android UI. If you do run a web login alongside the app, &lt;a href="https://velprove.com/blog/monitor-saas-login-page" rel="noopener noreferrer"&gt;layer the browser login monitor on the web sign-in&lt;/a&gt; and keep the API monitors on the native app's endpoints.&lt;/p&gt;

&lt;p&gt;One rule that trips people up: a monitor runs from one region you pick. There is no "run this from all five regions" toggle on a single monitor. If you want coverage from multiple regions, you create one monitor per region against the same endpoint. All five global regions are available on every plan, including free. So to watch your token endpoint from, say, North America and Europe, that is two monitors on the same URL, one per region, and a regional failure shows up as one of them going red while the other stays green, which is how you tell a regional problem from a global one.&lt;/p&gt;

&lt;p&gt;And to be clear about ownership: this post is about your own backend behind a native client. If the API your app depends on is a third party you do not control, like an LLM provider, the triage is different, and &lt;a href="https://velprove.com/blog/monitor-ai-app-when-llm-provider-degrades" rel="noopener noreferrer"&gt;the same pattern when the API your app calls is a third-party provider&lt;/a&gt; covers that case. Here we are watching the endpoints you wrote.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Can Velprove monitor APNs or FCM push notifications?
&lt;/h3&gt;

&lt;p&gt;Not push delivery itself, and it runs no code on the device. What it monitors is your own backend's token-registration endpoint, the leg where the app POSTs its device token to your server. Firebase's own docs tell you to store those tokens on your server, and Apple's APNs flow has the app forward its token to your provider. If that registration endpoint starts returning 500s, new installs and token refreshes silently stop registering, and those users simply never get pushes, with no crash and nothing the user sees. A Velprove API monitor POSTs a sentinel token to that endpoint and asserts on the success response, so you find out the moment it breaks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can Velprove test my app's force-update gate?
&lt;/h3&gt;

&lt;p&gt;Yes, by monitoring the backend version-check endpoint your app calls on launch to decide whether the installed client version is still allowed. Velprove cannot trigger the native update UI, because there is no native force-update primitive on iOS. The gate is something your app implements against that endpoint. So Velprove monitors the endpoint: it asserts the status is 200 and that the JSON body still carries the expected minimum-version shape, so a broken or garbage version-check response turns the monitor red before it locks out or fails open for your whole install base.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Velprove sign into my actual iOS or Android app?
&lt;/h3&gt;

&lt;p&gt;No. The browser login monitor drives a real web browser, not a native app, so it cannot tap through your native iOS or Android UI. Coverage for a native app's backend is the multi-step API monitor hitting the same HTTP endpoints the app calls: it signs in against your login or token endpoint, captures the token, and replays it against a protected endpoint. The browser login monitor still applies if you have a companion web sign-in at the same domain, layered on top of the API monitors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Velprove a crash-reporting tool like Crashlytics or Sentry?
&lt;/h3&gt;

&lt;p&gt;No, and it is complementary, not a replacement. Crashlytics and Sentry are on-device crash and ANR reporters: they catch crashes inside the app and go quiet when the app is fine but your backend is down. A backend that returns clean HTTP 503s causes spinners and login failures with zero crashes, which is invisible to a crash reporter. Velprove runs no on-device code. It probes the backend endpoints externally from a region you pick, watching the server the app depends on. Run both: the crash reporter for client-side stability, Velprove for backend uptime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Will a generic HTTPS monitor catch a certificate-pinning failure?
&lt;/h3&gt;

&lt;p&gt;Not the pin mismatch itself. A browser-based or CA-trust monitor stays green while a pinned app fails TLS, because the monitor trusts any valid CA while the app trusts only its old pinned key. Velprove's SSL certificate monitoring, toggled on an HTTPS monitor, reads the live leaf certificate and warns as expiry approaches, and the HTTPS monitor's TLS handshake fails the moment a rotated certificate stops validating, which is your cue to release the updated pin in a new app build. What it cannot do is confirm the app's pin set was actually updated. That confirmation only comes from releasing and testing the client build.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I monitor my mobile app backend on the free plan?
&lt;/h3&gt;

&lt;p&gt;Yes. Free includes 10 monitors at 5-minute intervals, multi-step API monitors up to 3 steps, one no-code browser login monitor every 15 minutes, all six assertion types, email alerts, and commercial use allowed. A monitor runs from one region you pick. To cover more regions, create one monitor per region against the same endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor your mobile app backend
&lt;/h2&gt;

&lt;p&gt;The trap is the green that is not yours: the App Store listing, the marketing homepage, the crash reporter that stays quiet because nothing crashed, all reporting fine while your backend hands every user a spinner. The fix is to monitor the endpoints the app actually calls. A multi-step API monitor on the auth and token-refresh flow that captures a token and replays it against protected data. Single API monitors on the push-token registration endpoint and the version-check gate, with body assertions, because those fail silently. SSL certificate monitoring on the HTTPS monitor as a cert-rotation tripwire if you pin. And the no-code browser login monitor on your companion web sign-in if you have one.&lt;/p&gt;

&lt;p&gt;Velprove's free plan covers all of it: 10 monitors at 5-minute intervals, multi-step API monitors up to 3 steps, one no-code browser login monitor, all six assertion types, email alerts, commercial use allowed, and no credit card. Pick a region per monitor, and add a second region on the endpoints that matter most. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt; and point a monitor at the API your iOS and Android app actually calls.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>GitHub Outage, June 2026: Signed-Out Users Lost Pull Requests and Issues</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Tue, 09 Jun 2026 14:00:05 +0000</pubDate>
      <link>https://dev.to/velprove/github-outage-june-2026-signed-out-users-lost-pull-requests-and-issues-4nk6</link>
      <guid>https://dev.to/velprove/github-outage-june-2026-signed-out-users-lost-pull-requests-and-issues-4nk6</guid>
      <description>&lt;p&gt;&lt;strong&gt;The blind spot:&lt;/strong&gt; on June 8 2026, GitHub's Pull Requests, Issues, and Actions went down for signed-out visitors while logged-in users saw nothing wrong, an incident GitHub ran from 07:11 to 08:36 UTC. The structural problem is that internal dashboards and logged-in QA are constitutionally blind to a signed-out outage, because everyone doing the checking is signed in. The only instrument that sees this class of failure is an external, unauthenticated monitor that hits the public URL with no auth cookie and asserts on the content a signed-out visitor is supposed to get. That is the monitor Velprove gives you free: an unauthenticated HTTP monitor with a content assertion, so green means the public page actually rendered. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start free&lt;/a&gt;, no credit card required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the people checking couldn't see it
&lt;/h2&gt;

&lt;p&gt;A signed-out outage is the one failure shape your own team is structurally unequipped to notice, because the act of checking removes you from the affected population. Every engineer dogfooding the product is signed in. Every internal dashboard queries from an authenticated service account. Every QA pass runs against a logged-in session. So when the break lives specifically on the anonymous, signed-out render path, the people best positioned to catch it are the people least able to see it. They are all on the wrong side of the login.&lt;/p&gt;

&lt;p&gt;That is exactly what happened to GitHub on June 8 2026. An engineer with an active session could open a repo's Issues tab or a Pull Request and see it render normally. The internal view was green because the internal view is, by definition, authenticated. Meanwhile the experience that broke was the one nobody on the inside was looking at: the first-time visitor with no account, the search crawler, the anonymous API caller, the developer who clicked a link to a public PR without logging in first.&lt;/p&gt;

&lt;p&gt;This is the same structural problem behind every silent outage. An internal signal looks healthy while the external reality is broken. We have written up the broader catalog in &lt;a href="https://velprove.com/blog/anatomy-of-a-silent-outage" rel="noopener noreferrer"&gt;the silent-outage taxonomy&lt;/a&gt; . The signed-out variant is one of the cleanest cases in that catalog, because the green you see is real. Your authenticated path genuinely works. It is just not the path that broke.&lt;/p&gt;

&lt;h2&gt;
  
  
  The timeline (UTC, primary-source)
&lt;/h2&gt;

&lt;p&gt;The timeline below is taken verbatim from the GitHub Status incident permalink. All times UTC. Source: &lt;a href="https://www.githubstatus.com/incidents/m7n7sm0sr1pz" rel="noopener noreferrer"&gt;GitHub Status incident m7n7sm0sr1pz&lt;/a&gt; .&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;07:11&lt;/td&gt;
&lt;td&gt;Investigating. "We are investigating reports of impacted performance for some GitHub services."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;07:14&lt;/td&gt;
&lt;td&gt;Investigating. "Issues is experiencing degraded availability. We are continuing to investigate."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;07:31&lt;/td&gt;
&lt;td&gt;Investigating. "Issues is experiencing degraded performance. We are continuing to investigate."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;07:32&lt;/td&gt;
&lt;td&gt;Investigating. "Pull Requests is experiencing degraded performance. We are continuing to investigate."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;08:13&lt;/td&gt;
&lt;td&gt;Investigating. "Following investigation, we are seeing that impact is limited to unauthenticated users when accessing Pull Requests, Issues, or Actions."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;08:27&lt;/td&gt;
&lt;td&gt;Investigating. "Actions is experiencing degraded performance. We are continuing to investigate."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;08:35&lt;/td&gt;
&lt;td&gt;Monitoring. "The degradation affecting Actions, Issues and Pull Requests has been mitigated. We are monitoring to ensure stability."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;08:36&lt;/td&gt;
&lt;td&gt;Resolved. "This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GitHub classified the incident as critical. Start to resolution, the window ran 07:11 to 08:36 UTC, roughly 1 hour 25 minutes. The resolution update promised a root cause analysis when available; as of writing, none had been published, so this post does not speculate on cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke for signed-out visitors (and what didn't)
&lt;/h2&gt;

&lt;p&gt;Here is the scope, in GitHub's own framing and no further.&lt;/p&gt;

&lt;p&gt;What broke: Pull Requests, Issues, and Actions, for unauthenticated, signed-out users. The incident title named Pull Requests and Issues; the 08:13 and 08:35 updates explicitly added Actions. GitHub described the impact as "degraded availability" and "degraded performance," with the incident title using the word "unavailable."&lt;/p&gt;

&lt;p&gt;What GitHub did NOT claim was affected: authenticated web sessions. The 08:13 update is explicit that impact was "limited to unauthenticated users." We state that as GitHub's characterization of the blast radius, not as an independently verified absolute, because we did not measure it ourselves. But it is the published scope, and it is the whole reason this incident is a clean case study in signed-out blindness.&lt;/p&gt;

&lt;p&gt;A few things this post deliberately does not assert, because GitHub did not publish them. There is no error rate or percentage of failed requests; GitHub did not quantify the blast radius. There is no claim about git operations, clone, push, or pull, because GitHub did not mention them. And there is no root cause, because none was disclosed. The supportable story is narrow on purpose: a critical incident, scoped by GitHub to the signed-out public surface, on three named features, for about 85 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ~62-minute scoping delta
&lt;/h2&gt;

&lt;p&gt;There is one supportable number worth pulling out, and it has to be stated carefully so it does not get read as something it is not.&lt;/p&gt;

&lt;p&gt;GitHub opened the incident at 07:11 UTC with a generic line: "investigating reports of impacted performance for some GitHub services." It was not until 08:13 UTC, roughly 62 minutes later, that the public update first identified who was actually affected: "impact is limited to unauthenticated users." For that first hour, every public status update said only "some services" or "Issues degraded," without naming that the blast radius was the anonymous public surface.&lt;/p&gt;

&lt;p&gt;Be precise about what this delta is and is not. It is a scoping delta: the gap between acknowledging a problem and characterizing who it hit. It is NOT a detection-delay number. GitHub's feed reports the impact start time as equal to the acknowledgment time, so there is no GitHub-stated earlier impact-start to subtract, and this post does not invent one. We are not claiming the outage was silently broken for some measured number of minutes before GitHub noticed. We are pointing at the 62 minutes during which the public-facing scoping was generic, which is exactly the window where an external signal that already knew "the signed-out path is broken" would have been most useful to anyone trying to triage their own reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Velprove monitors public/unauthenticated access
&lt;/h2&gt;

&lt;p&gt;This is the solution, and it follows directly from the spine. If your own checking is blind because it is authenticated, the fix is a monitor that is deliberately unauthenticated.&lt;/p&gt;

&lt;p&gt;The instrument is an external HTTP and content monitor that requests a public URL with no session cookie, the same as a first-time anonymous visitor, and asserts on the content the signed-out page is supposed to render. The public surface to probe is whatever a signed-out user actually hits: a public repository's Issues or Pull Requests page, or an anonymous endpoint on &lt;code&gt;api.github.com&lt;/code&gt; that returns public data without a token. Point the monitor there as an anonymous client, exactly as the public reaches it.&lt;/p&gt;

&lt;p&gt;The load-bearing detail is what you assert on. A status code alone is not enough. During a signed-out break, the anonymous path can still return a 200 from a fallback render, an error page, or a partial shell, and a status-only probe would stay green through the whole incident. So you assert on two things together: a 200 status, and a content match on a string or element that only the real public page renders for an anonymous visitor. For a public Issues page that might be the issue-list heading or a known issue title that always appears for the signed-out view; for an anonymous API endpoint it is a field that the public JSON always contains. When the signed-out experience breaks, the content assertion fails even if the status code does not, and the monitor turns red while a logged-in session would still look fine.&lt;/p&gt;

&lt;p&gt;For the broader reason a plain 200 OK probe misses this entire class of failure, see &lt;a href="https://velprove.com/blog/why-uptime-monitors-miss-outages" rel="noopener noreferrer"&gt;why HTTP probes miss this class of failure&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;One note on vantage point, since it only matters if you choose to use more than one: a monitor runs from one region you pick. The June 8 incident was characterized by GitHub as a feature-scoped impact rather than a regional one, so a single unauthenticated monitor from any one region would have been watching the right surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honesty boundary
&lt;/h2&gt;

&lt;p&gt;The strongest version of this post is the version that names what it does not claim.&lt;/p&gt;

&lt;p&gt;First, and most important: Velprove did not detect this incident. We did not monitor GitHub. Everything above is the failure shape that an external unauthenticated monitor is built to catch, presented as a worked example, not as a claim that we caught it. If you read a detection into this post, that is more than the facts support.&lt;/p&gt;

&lt;p&gt;Second, an unauthenticated monitor sees the public surface and only the public surface. It catches a signed-out break on a page or endpoint you point it at. It does not see authenticated-only subsystems, the logged-in dashboards, the account-scoped APIs, the parts of the product that only exist behind a session. Those need their own instrument. The signed-out monitor is exactly as wide as the anonymous public path and no wider.&lt;/p&gt;

&lt;p&gt;Third, we do not claim to know what broke inside GitHub. The resolution update said a root cause analysis would be shared when available, and at the time of writing none had been published. Anything more specific than GitHub's own words would be guessing, so we quote them and stop there.&lt;/p&gt;

&lt;p&gt;Fourth, the content assertion is only as good as the string you pick. Assert on something stable that the public page genuinely always renders for an anonymous visitor, not on a marketing banner or an A/B-tested element that can legitimately disappear. A brittle assertion produces false reds; a well-chosen one is what separates "the public page rendered" from "something returned 200."&lt;/p&gt;

&lt;h2&gt;
  
  
  This pattern, not just this incident
&lt;/h2&gt;

&lt;p&gt;The signed-out outage is the mirror image of a failure we teared down recently. In &lt;a href="https://velprove.com/blog/github-actions-may-2026-detection-teardown" rel="noopener noreferrer"&gt;the GitHub Actions May 2026 detection teardown&lt;/a&gt; , the authenticated and orchestration surface degraded while the public web surface stayed up, and the right instrument was a content-aware API monitor on an authenticated runs endpoint. Here the asymmetry is reversed: the public, signed-out surface broke while the authenticated experience stayed fine, and the right instrument is an unauthenticated content monitor on a public URL.&lt;/p&gt;

&lt;p&gt;The two read as a pair. Same underlying lesson, opposite sides of the login. In one, a plain probe on the public web stayed green while authenticated orchestration burned; in the other, an authenticated session stayed green while the anonymous public path burned. Both are members of the same family in &lt;a href="https://velprove.com/blog/anatomy-of-a-silent-outage" rel="noopener noreferrer"&gt;our catalog of silent outages&lt;/a&gt; : an internal signal looks healthy while a specific external population is locked out of the truth. The only durable defense is to monitor the exact surface and the exact identity (or absence of identity) that your own checking can't reach. If your team is all signed in, the surface you can't see is the signed-out one.&lt;/p&gt;

&lt;p&gt;This is the generalized lesson for any product with a public surface, not just GitHub. Marketing pages, docs, public dashboards, anonymous API endpoints: any of them can break for the logged-out world while your logged-in team sees green. If that is the shape of your stack, &lt;a href="https://velprove.com/for/saas" rel="noopener noreferrer"&gt;Velprove for SaaS&lt;/a&gt; and &lt;a href="https://velprove.com/for/api" rel="noopener noreferrer"&gt;Velprove for API monitoring&lt;/a&gt; are built around exactly this gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Was GitHub down for everyone in June 2026?
&lt;/h3&gt;

&lt;p&gt;No. GitHub characterized the June 8 2026 incident as limited to unauthenticated, signed-out users accessing Pull Requests, Issues, and Actions. In GitHub's own words at 08:13 UTC, impact was limited to unauthenticated users. Signed-in developers were not characterized as impacted. That is GitHub's characterization of the blast radius, not an independently measured absolute, but it is the published scope.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened during the GitHub signed-out outage on June 8 2026?
&lt;/h3&gt;

&lt;p&gt;The incident ran from 07:11 UTC to 08:36 UTC on June 8 2026, roughly 1 hour 25 minutes. During the window, Pull Requests, Issues, and Actions were degraded for unauthenticated visitors while authenticated sessions rendered normally. GitHub classified the incident as critical. The root cause was not disclosed in the resolution update, which stated a detailed analysis would be shared when available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why could logged-in users still see Pull Requests and Issues?
&lt;/h3&gt;

&lt;p&gt;The break lived on the anonymous public render path, the version of a page served to a visitor with no session cookie. An authenticated request takes a different path through GitHub's systems and rendered normally, which is why a signed-in engineer browsing the same repo saw nothing wrong. The two experiences diverged: green for the logged-in account, broken for the signed-out public.&lt;/p&gt;

&lt;h3&gt;
  
  
  How can you monitor a signed-out outage?
&lt;/h3&gt;

&lt;p&gt;Point an external unauthenticated monitor at a public URL with no session cookie, and assert on content that only the anonymous public path renders, not just a 200 status. A status-only probe can stay green on a fallback or error page that still returns 200. A content assertion on a string the public page is supposed to render turns the monitor red when the signed-out experience breaks, even while a logged-in session stays fine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did Velprove detect the GitHub outage?
&lt;/h3&gt;

&lt;p&gt;No. Velprove did not monitor GitHub and did not detect this incident. This post uses it as a worked example of the failure shape an external unauthenticated monitor is built to catch, not as a claim that Velprove caught it. The honest framing is the pattern, not a detection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does GitHub have an SLA for this, and would it credit?
&lt;/h3&gt;

&lt;p&gt;Consumer plans (Free, Pro, Team) do not carry a contractual SLA, and a roughly 1 hour 25 minute degradation is unlikely to breach a typical 99.9 percent monthly threshold even where an SLA applies. The operational angle here is visibility, not the credit. For the deeper SLA-credit math, see &lt;a href="https://velprove.com/blog/sla-vs-slo-vs-sli-customer-guide" rel="noopener noreferrer"&gt;the SLA-vs-SLO-vs-SLI breakdown&lt;/a&gt; .&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor your public surface
&lt;/h2&gt;

&lt;p&gt;The trap is the blind spot: your whole team is signed in, so the one outage you are structurally unequipped to notice is the one that only hits signed-out visitors. The fix is an external monitor that hits the public URL with no auth cookie and asserts on the content a signed-out visitor is supposed to get, so green means the public page rendered, not just that something returned 200.&lt;/p&gt;

&lt;p&gt;Velprove's free plan covers this: ten monitor slots you can spread across regions, content assertions on HTTP monitors, email alerts, and no credit card. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt; and point an unauthenticated monitor at your own public surface. If your stack is API-shaped, start with &lt;a href="https://velprove.com/for/api" rel="noopener noreferrer"&gt;Velprove for API monitoring&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Cloudflare Certificate Failures, June 2026: Let's Encrypt Chains Broke TLS</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Tue, 09 Jun 2026 14:00:03 +0000</pubDate>
      <link>https://dev.to/velprove/cloudflare-certificate-failures-june-2026-lets-encrypt-chains-broke-tls-4gno</link>
      <guid>https://dev.to/velprove/cloudflare-certificate-failures-june-2026-lets-encrypt-chains-broke-tls-4gno</guid>
      <description>&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt; Cloudflare logged two minor certificate incidents three days apart in June 2026, one on June 2 running about 28.6 hours and one on June 5 running about 10.5 hours, both attributed to unsupported CA bundling on a subset of Let's Encrypt certificates that could fail the TLS handshake for some visitors. The lesson: a site can resolve and look completely valid in your own browser while failing TLS for a subset of clients on a certificate whose expiry date is perfectly healthy. The signal that surfaces that is an external HTTPS monitor that performs a real TLS handshake from outside, the same handshake a visitor's browser makes. That is what Velprove's SSL monitoring does: it reads the served leaf certificate and fails the run when the served certificate will not validate, on the free plan.&lt;/p&gt;

&lt;p&gt;Here is the failure that this teardown is about: a site that resolves, answers at the application layer, and looks completely valid when you open it in your browser, yet fails the TLS handshake for a subset of visitors. The certificate's expiry date is healthy the entire time, so nothing about the date is wrong. The problem is the chain the server bundles and serves to clients. That is exactly the shape Cloudflare published twice in June 2026, and it is a shape a days-remaining number never sees.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke on Cloudflare in June 2026 (and what didn't)
&lt;/h2&gt;

&lt;p&gt;Cloudflare's own wording is the anchor here, so we quote it rather than paraphrase past it. In both incidents, Cloudflare stated that an unsupported CA bundling issue affected a subset of Let's Encrypt certificates and may have resulted in TLS connectivity issues for some visitors. Read those qualifiers literally. It was a subset, not all certificates. It was some visitors, not everyone. And the verb was may, because the failure was partial and client-dependent.&lt;/p&gt;

&lt;p&gt;The mechanism Cloudflare named is the served certificate chain, not the certificate's issuance. Unsupported CA bundling describes the bundle of certificates a server sends a client so the client can build a trusted path back to a root. When that bundle is wrong, a strict client cannot complete the path and rejects the connection, even though the leaf certificate is genuinely valid. This was not Let's Encrypt failing to issue certificates, and it was not a Cloudflare private-key problem. Issuance worked. Unaffected certificates worked. The application layer behind the certificate was fine. What broke was the chain served to some clients.&lt;/p&gt;

&lt;p&gt;One consequence worth stating once: an interactive browser that hits a broken served chain often shows a warning a human can click through, while an automated client hard-fails the handshake with nobody there to override it. We do not re-teach that mechanic here because it is the subject of a separate post; for the full browser-versus-machine-client breakdown and the chain mechanics behind it, see &lt;a href="https://velprove.com/blog/ssl-renewal-automation-chain-monitoring" rel="noopener noreferrer"&gt;when SSL renewal automation fails silently&lt;/a&gt; .&lt;/p&gt;

&lt;h2&gt;
  
  
  The timeline (UTC, primary-source)
&lt;/h2&gt;

&lt;p&gt;Cloudflare recorded two distinct incidents three days apart, both on the SSL Certificate Provisioning component, both impact minor, both describing the same unsupported CA bundling cause. The updates below are quoted verbatim from the Cloudflare status incident permalinks. All times are UTC. Incident A's official title carries the spelling “Lets Encrypt” without an apostrophe, which we leave as Cloudflare published it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident A:&lt;/strong&gt; June 2 00:25 UTC to June 3 05:01 UTC, roughly 28 hours 36 minutes. Source: &lt;a href="https://www.cloudflarestatus.com/incidents/j17t8xz91xs0" rel="noopener noreferrer"&gt;Cloudflare Status incident j17t8xz91xs0&lt;/a&gt; .&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Update (verbatim)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Jun 2, 00:25&lt;/td&gt;
&lt;td&gt;Identified&lt;/td&gt;
&lt;td&gt;“Cloudflare has identified an issue affecting a subset of Let's Encrypt certificates, in which unsupported CA bundling may result in TLS connectivity issues for some visitors. Customers requiring immediate resolution may order a replacement certificate; re-issuance from the same Certificate Authority will resolve the issue. Customer action is not required for a permanent fix.”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 3, 05:01&lt;/td&gt;
&lt;td&gt;Resolved&lt;/td&gt;
&lt;td&gt;“This incident has been resolved.”&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Incident B:&lt;/strong&gt; June 5 17:22 UTC to June 6 03:53 UTC, roughly 10 hours 30 minutes. Source: &lt;a href="https://www.cloudflarestatus.com/incidents/rsb9bsncwr64" rel="noopener noreferrer"&gt;Cloudflare Status incident rsb9bsncwr64&lt;/a&gt; .&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Update (verbatim)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Jun 5, 17:22&lt;/td&gt;
&lt;td&gt;Investigating&lt;/td&gt;
&lt;td&gt;“Cloudflare has identified an issue affecting a subset of Let's Encrypt certificates, in which unsupported CA bundling may result in TLS connectivity issues for some visitors.”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 5, 17:28&lt;/td&gt;
&lt;td&gt;Identified&lt;/td&gt;
&lt;td&gt;“Cloudflare has identified an issue affecting a subset of Let's Encrypt certificates, in which unsupported CA bundling may result in TLS connectivity issues for some visitors. Customers requiring immediate resolution may order a replacement certificate; re-issuance from the same Certificate Authority will resolve the issue. Customer action is not required for a permanent fix.”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 6, 03:42&lt;/td&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;“A fix has been implemented and we are monitoring the results.”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 6, 03:53&lt;/td&gt;
&lt;td&gt;Resolved&lt;/td&gt;
&lt;td&gt;“This incident has been resolved.”&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two records share identical cause language and identical remediation language. It is tempting to call them one event, but Cloudflare did not. There is no combined post-mortem and no statement that June 5 was a recurrence of June 2. We treat the shared root cause as an inference from the matching wording, not as a fact Cloudflare asserted. What is certain is that the same failure shape was logged twice in four days.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a broken served chain is hard to catch
&lt;/h2&gt;

&lt;p&gt;Start with the date, because the date is the trap. Throughout both windows, the leaf certificate's expiry was healthy. Nothing was about to lapse. So a monitor that counts down days until expiry, the right tool for catching a certificate that is genuinely about to expire, would have read green the entire time and never fired. This is not the expiry failure class; for the days-remaining countdown and the 30-15-7 threshold rule that governs it, see &lt;a href="https://velprove.com/blog/ssl-certificate-expiry-monitoring" rel="noopener noreferrer"&gt;SSL certificate expiry monitoring&lt;/a&gt; . That post owns the countdown. This failure lives where the countdown goes blind.&lt;/p&gt;

&lt;p&gt;Then there is the partial shape. Cloudflare said some visitors, not all. A chain that is mis-bundled can still validate for a client that already has the missing piece cached or bundled, and fail for a client that needs the served chain to be complete and correct. So the failure can hide from any single observer whose TLS stack happens to accept the served chain. Your laptop validates it; a partner's stricter client does not. And because the site answers at the application layer, every surface that is not the TLS handshake itself looks up.&lt;/p&gt;

&lt;p&gt;Put those together and you get the most deceptive kind of outage: a valid-looking site, a healthy expiry date, a working application, and a TLS handshake that fails for a subset of the people trying to reach you. The only instrument positioned to see it is one that makes the same handshake a real client makes, from outside.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Velprove monitors TLS/SSL certificates
&lt;/h2&gt;

&lt;p&gt;An external HTTPS monitor performs a real TLS handshake from a vantage point outside your infrastructure, exactly the way a visitor's browser does. The moment a served certificate stops validating, that handshake fails and the monitor flips to Down. That is the detection signal an internal or application-layer view misses, because internally the application is still answering and the certificate file on disk still looks fine. The failure only exists on the wire, between a real client and the served chain, which is precisely where an outside handshake looks.&lt;/p&gt;

&lt;p&gt;Setting this up is a single HTTPS monitor pointed at the hostname you serve. On each run it completes a real TLS handshake, and if that handshake fails because the served certificate does not validate, the monitor records a Down result rather than a quiet green. That detection is the base behavior of an HTTPS monitor; it needs no extra option turned on, because a failed handshake is a failed run. The wizard also offers an SSL certificate expiry alert toggle that adds a days-remaining countdown and reads the served leaf certificate, its expiry date and its issuer, but that countdown is a separate, complementary signal: it watches for a certificate that is about to lapse, which is not what failed here. There is no configuration file to hand-write and no code to maintain.&lt;/p&gt;

&lt;p&gt;To see the symptom concretely, point a monitor at a host that deliberately serves a certificate a standard client will not trust, such as one of the badssl.com test hosts. The handshake fails closed, and the monitor flips to Down with a connection error, the same error a real client would hit. That is the alert that would have surfaced a served-chain failure from outside, with no chain expertise required to read it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honesty boundary
&lt;/h2&gt;

&lt;p&gt;The strongest version of this post is the one that names what it does not claim. Four limits matter here.&lt;/p&gt;

&lt;p&gt;First, the boundary on what Velprove reads. Velprove's SSL monitoring reads the served leaf certificate, its expiry and issuer, and the HTTPS handshake fails when the served certificate is invalid. It does not introspect the full chain and it does not diagnose the CA-side cause. So it tells you that TLS is failing for clients. It does not tell you that the reason is unsupported CA bundling. For walking and grading the served chain on a specific host, use a dedicated chain inspector like &lt;a href="https://www.ssllabs.com/ssltest/" rel="noopener noreferrer"&gt;SSL Labs&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Second, detection here is vantage and client-dependent. Because the failure hit a subset of clients, whether a given probe trips depends on whether that probe's TLS client needs the broken part of the served chain. Catching a partial-subset failure from one vantage is a realistic possibility, not a guarantee. This is the same hard-fail shape that bites machine-to-machine clients with no human to click through, including pinned mobile clients, covered in &lt;a href="https://velprove.com/blog/monitor-mobile-app-backend-api" rel="noopener noreferrer"&gt;monitoring a mobile app backend&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Third, and most important: Velprove did not detect this specific Cloudflare incident. We have no such observation. This is an illustrative teardown of the failure class and the detection mechanism, not a claim that any Velprove monitor caught the June 2026 events.&lt;/p&gt;

&lt;p&gt;Fourth, detection is not prevention. An external handshake surfaces a served-chain failure faster than a human spot-check or a green expiry badge will. It does not stop the failure from happening. The value is knowing, early, that real clients cannot connect.&lt;/p&gt;

&lt;h2&gt;
  
  
  This pattern, not just this incident
&lt;/h2&gt;

&lt;p&gt;A valid-looking site that fails TLS for a subset of clients is a recurring silent-outage shape, not a one-off. Cloudflare logging the same shape twice in four days is itself the point: this is a pattern with a name and a detection instrument, not a freak event. The broader taxonomy of outages that answer green to a naive check lives in &lt;a href="https://velprove.com/blog/anatomy-of-a-silent-outage" rel="noopener noreferrer"&gt;the anatomy of a silent outage&lt;/a&gt; , and a sibling dated teardown of a different vendor incident, where the response body told a truth the status code hid, is &lt;a href="https://velprove.com/blog/github-actions-may-2026-detection-teardown" rel="noopener noreferrer"&gt;the GitHub Actions May 2026 detection teardown&lt;/a&gt; . Same lesson, different surface: what a real client experiences is the only thing worth asserting on.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to monitor for TLS and SSL
&lt;/h2&gt;

&lt;p&gt;The recommendation is small and concrete. Run an HTTPS monitor against each hostname you actually serve, on the free plan, from one region you pick out of the five available. The HTTPS monitor completes a real TLS handshake from outside on every run by default, so a served certificate that stops validating surfaces as a Down result instead of hiding behind a healthy expiry date and a working application. Add the SSL certificate expiry alert toggle if you also want a days-remaining countdown for the separate case of a certificate about to lapse.&lt;/p&gt;

&lt;p&gt;Keep the boundary in mind as you do it: this reads the served leaf and fails the handshake on an invalid served certificate; it does not diagnose the chain, so pair it with SSL Labs when you need to know why a chain is malformed. Detection from outside is what turns a partial, valid-looking TLS failure into an alert you can act on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Set up a free HTTPS monitor with Velprove&lt;/a&gt; . A real TLS handshake from the region you choose, no credit card required, commercial use allowed on every plan including free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What happened in the Cloudflare certificate outage in June 2026?
&lt;/h3&gt;

&lt;p&gt;Cloudflare logged two minor incidents on its SSL Certificate Provisioning component: one on June 2 2026 lasting roughly 28.6 hours, and one on June 5 2026 lasting roughly 10.5 hours. In both, Cloudflare's published cause was an unsupported CA bundling issue affecting a subset of Let's Encrypt certificates, which may have caused TLS connectivity issues for some visitors. This was a certificate serving and chain-bundling problem, not a certificate issuance failure. Cloudflare rebuilt the affected certificate chains, and the affected certificates were automatically restored to a valid state, so no customer action was required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Was the whole Cloudflare network down in June 2026?
&lt;/h3&gt;

&lt;p&gt;No. Cloudflare labeled both incidents minor, and the stated scope was a subset of Let's Encrypt certificates and some visitors. This was not a network-wide outage. Most traffic and most certificates were unaffected. The failure was partial and client-dependent, which is part of why it is hard to notice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Was the Cloudflare cert issue Let's Encrypt's fault or Cloudflare's?
&lt;/h3&gt;

&lt;p&gt;Cloudflare's published cause was an unsupported CA bundling problem on the served chain, which is a serving and chain-construction issue, not a Let's Encrypt issuance failure. Cloudflare did not attribute the incidents to a Let's Encrypt-side outage, and neither do we. We also do not assert a single confirmed root cause shared by the two windows. The wording in both records is identical, but Cloudflare published no combined post-mortem, so a shared root cause is an inference, not a stated fact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can a monitor catch a certificate that fails for only some clients?
&lt;/h3&gt;

&lt;p&gt;An external HTTPS monitor performs a real TLS handshake from outside and flips to Down when the served certificate fails to validate for its probe. For this failure class, that is a realistic catch. But on a partial-subset failure the detection is vantage and client-dependent: whether a given probe trips depends on whether that probe's TLS client needs the broken part of the served chain. So it is a realistic possibility, not a guarantee for every such incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did Velprove detect the Cloudflare cert issue in June 2026?
&lt;/h3&gt;

&lt;p&gt;No. We have no such observation. This post is an illustrative teardown of the failure class and the detection mechanism an external TLS handshake provides. It is not a claim that Velprove caught this specific Cloudflare incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  What can Velprove's SSL monitoring tell me, and what can't it?
&lt;/h3&gt;

&lt;p&gt;Velprove's SSL monitoring reads the served leaf certificate, meaning its expiry date and issuer, and the HTTPS handshake fails when the served certificate is invalid. So it catches that TLS is failing for clients. It does not introspect the full chain or diagnose the CA-side cause such as unsupported CA bundling. For walking and grading the served chain, use a dedicated chain inspector like &lt;a href="https://www.ssllabs.com/ssltest/" rel="noopener noreferrer"&gt;SSL Labs&lt;/a&gt; .&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Monitor a Cloudflare Tunnel or Access-Protected App</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Mon, 08 Jun 2026 14:00:03 +0000</pubDate>
      <link>https://dev.to/velprove/monitor-a-cloudflare-tunnel-or-access-protected-app-3m3f</link>
      <guid>https://dev.to/velprove/monitor-a-cloudflare-tunnel-or-access-protected-app-3m3f</guid>
      <description>&lt;p&gt;&lt;strong&gt;The honest version:&lt;/strong&gt; To monitor an app behind Cloudflare Access you have to authenticate the probe, because a naive status-only monitor pointed at the hostname goes green on the Access login page, not on your app. The monitor typically follows a 302 redirect into the login flow and lands on a login interstitial that returns 200, so your check reports up while never touching the real app behind the gate. Cloudflare Tunnel and Cloudflare Access are two different layers (Tunnel is the connector that gives a no-public-IP origin a hostname; Access is the Zero Trust auth gate in front of it), and each fails in its own way. The fix for the Access side is an external probe that authenticates with a service token: add an Access policy with action Service Auth, send the &lt;code&gt;CF-Access-Client-Id&lt;/code&gt; and &lt;code&gt;CF-Access-Client-Secret&lt;/code&gt; headers, and assert on post-gate content. That is exactly what Velprove does with a free API monitor that sends custom headers, plus a no-code browser login monitor for the human sign-in path. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start free&lt;/a&gt;, no credit card required.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The check is green and nobody can reach your app
&lt;/h2&gt;

&lt;p&gt;A status-only monitor goes green on a Cloudflare Access app because it measures Cloudflare's login page, not your app behind the gate. Here is the failure that sends people looking for this post. You put a basic uptime check on &lt;code&gt;https://app.example.com&lt;/code&gt;, an app sitting behind Cloudflare Access. The check goes green and stays green. Then a colleague pings you because the app has been throwing errors for an hour. The monitor never noticed, because the monitor was never looking at your app.&lt;/p&gt;

&lt;p&gt;When an unauthenticated client hits an Access-protected hostname, Cloudflare Access does not hand it the app. It sends the client into the login flow instead. For a default self-hosted Access application, that is typically a 302 redirect toward the Access and identity provider login. A status-only monitor that follows redirects walks right into the login interstitial, which itself returns a clean HTTP 200. So an "is it 200?" check reports green. It has measured that Cloudflare's login page is up, which it almost always is, and it has learned nothing about the app behind the gate.&lt;/p&gt;

&lt;p&gt;The other version of the same bug is a false red. A stricter check that does not follow redirects, or that demands a specific status, sees the 302 and trips. Now you get paged for an app that is perfectly healthy. In neither case is the monitor evaluating your app. It is arguing with the Access gate.&lt;/p&gt;

&lt;p&gt;One nuance worth stating, because it changes what you assert on. Apps that use Cloudflare's newer Managed OAuth may not return a 302 at all. A non-browser client can instead receive a 401 response with a &lt;code&gt;WWW-Authenticate&lt;/code&gt; header that points at the OAuth discovery endpoints. So you cannot hard-code "Access always returns code X." The response depends on how the application is configured. What is constant is the lesson: a status code alone, from an unauthenticated probe, tells you about the gate, not the app.&lt;/p&gt;

&lt;p&gt;This is the same structural problem behind every silent outage. An internal signal looks green while the external reality is broken. We have written about &lt;a href="https://velprove.com/blog/why-uptime-monitors-miss-outages" rel="noopener noreferrer"&gt;why uptime monitors miss real outages&lt;/a&gt; in general. The Cloudflare Access login page is one of the cleanest specific instances of it, because the green you get is not even your server. It is Cloudflare's.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Tunnel down vs origin down vs gate misconfigured: three different failures
&lt;/h2&gt;

&lt;p&gt;Before the fix, get the failure taxonomy straight, because an external probe has to tell these three apart, and Cloudflare serves a different page for each.&lt;/p&gt;

&lt;p&gt;First, some context on the Tunnel itself. A Cloudflare Tunnel app has no publicly routable IP and no open inbound ports. The &lt;code&gt;cloudflared&lt;/code&gt; daemon makes outbound-only connections to Cloudflare's network, and you can block all inbound traffic to the origin so it is reachable only through Cloudflare. A monitor cannot hit the origin directly. It must go through the Cloudflare hostname. That is the whole point of a tunnel, and it is also why "just curl the origin" is not an option here.&lt;/p&gt;

&lt;p&gt;For giving a NAT'd box a public hostname with &lt;code&gt;cloudflared&lt;/code&gt; in the first place, that setup is covered in &lt;a href="https://velprove.com/blog/monitor-self-hosted-app-stack" rel="noopener noreferrer"&gt;the self-hosted stack guide&lt;/a&gt; . This post assumes the tunnel already exists and focuses on monitoring through it and through Access. With that out of the way, here are the three failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error 1033, "Cloudflare Tunnel error."&lt;/strong&gt; The tunnel itself is not connected. Cloudflare cannot find a healthy &lt;code&gt;cloudflared&lt;/code&gt; instance to receive the traffic, usually because the connector process has stopped. Cloudflare serves its own 1033 error page from the edge. Your app is not involved at all. A monitor that only checks "did I get a response?" sees a response and may even see a 2xx-shaped error page, which is why content assertions matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;502 Bad Gateway.&lt;/strong&gt; This is a different failure. The tunnel IS connected to Cloudflare, but &lt;code&gt;cloudflared&lt;/code&gt; cannot reach the origin service defined in your ingress rule. The service may be down or not responding to traffic from &lt;code&gt;cloudflared&lt;/code&gt;. The connector is healthy. The thing behind it is not. Do not conflate this with 1033. One means the tunnel is down; the other means the origin behind a working tunnel is down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Access gate misconfigured or service token rejected.&lt;/strong&gt; You reach Cloudflare and you reach the gate, but you never reach the app, because the policy does not admit you. This is the failure the rest of the post is about, and the one a status-only check most often hides as a false green. It is covered in section four.&lt;/p&gt;

&lt;p&gt;The takeaway: an external monitor that asserts on a real string from your app distinguishes all three. A 1033 page, a 502 page, and an Access login page are all Cloudflare-edge-served responses. None of them contain the post-gate content your app renders. Assert on that content and any of the three turns the check red, for the right reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Why Cloudflare's own Tunnel-health and Access dashboards are not enough
&lt;/h2&gt;

&lt;p&gt;Cloudflare ships real first-party signals here, and an honest post says so. Tunnel monitoring reports connector health as Healthy, Degraded, or Down, and you can wire notifications to tunnel status. The Access dashboard confirms an application exists and that a policy is attached to it. Both are genuinely useful for what they do.&lt;/p&gt;

&lt;p&gt;The gap is what they do not do. Connector-healthy tells you &lt;code&gt;cloudflared&lt;/code&gt; is connected to Cloudflare. It does not tell you the origin behind the connector answers, and it cannot tell you a real user gets through the Access gate to a working app. Policy-exists tells you a rule is configured. It does not tell you the rule admits the right callers, or that the app returns content rather than a 500 once they are through. Connector up plus policy exists is not the same claim as end-to-end reachable.&lt;/p&gt;

&lt;p&gt;There is also a vantage-point problem. Cloudflare's own signals observe from inside Cloudflare's network. An external probe runs from outside it, which is the only way to witness the full path a user actually traverses: client to Cloudflare edge, through the Access gate, through the tunnel, to your origin, and back. Use Cloudflare's observability for debugging connector state and policy configuration. Use an external monitor for the ground-truth question of whether the app responds end to end.&lt;/p&gt;

&lt;p&gt;One disambiguation, because Cloudflare is a large surface. This post is about apps reached &lt;em&gt;through&lt;/em&gt; Cloudflare Tunnel and Cloudflare Access, which is Cloudflare One, the Zero Trust side. If instead you run your app &lt;em&gt;on&lt;/em&gt; Cloudflare's developer platform (Workers, Pages, KV, R2, D1, Durable Objects), the subsystem probes for that live in &lt;a href="https://velprove.com/blog/monitor-cloudflare-workers-pages-site" rel="noopener noreferrer"&gt;the Workers, Pages, KV, R2, and D1 monitoring guide&lt;/a&gt; . Different product family, different failure story. One link is enough; the rest of this post stays on Tunnel and Access.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The fix: monitor an app behind Cloudflare Access with a service token
&lt;/h2&gt;

&lt;p&gt;The recommended path is a service token. A service token is a Client ID and Client Secret pair that a machine client (your monitor) presents to satisfy an Access policy without an interactive identity provider login. Critically, this does not bypass Access. It satisfies an Access policy. Access still evaluates the policy on every request. You are authenticating, not skipping the gate.&lt;/p&gt;

&lt;p&gt;Here is the self-contained recipe for monitoring a Cloudflare Access protected app:&lt;/p&gt;

&lt;p&gt;Create an Access &lt;strong&gt;service token&lt;/strong&gt; in Cloudflare Zero Trust. You get a Client ID and a Client Secret. On the Access application, add a policy with action &lt;strong&gt;Service Auth&lt;/strong&gt; that admits that token. If you skip the Service Auth action, Access will prompt for an identity provider login and the token will not get you in. Configure your monitor to send two request headers on every request: &lt;code&gt;CF-Access-Client-Id&lt;/code&gt; and &lt;code&gt;CF-Access-Client-Secret&lt;/code&gt;. If the app has only Service Auth policies, the headers must be sent on every request, which is fine for a stateless monitor that hits the URL each run. Assert on &lt;strong&gt;post-gate content&lt;/strong&gt;: a string that only your real app renders, never the login page. Now a green check means the monitor passed Access and reached the app, not that it reached a login interstitial.&lt;/p&gt;

&lt;p&gt;A few honest caveats that come straight from how this works. The token must be permitted on that exact application. A token valid for app A does not authenticate app B; Access scopes it to the application whose policy admits it. And service tokens expire. Cloudflare's documentation uses an example lifetime of about a year, configurable, and the token needs rotation before it lapses. A forgotten rotation turns a previously-green monitor into an authentication failure. That is not the monitor breaking. That is the monitor correctly surfacing an expired credential. Cloudflare can alert you ahead of expiry; rotate the token, update the two headers in the monitor, and you are green again. Treating token expiry as a monitorable event is part of doing this properly, not a flaw in the approach.&lt;/p&gt;

&lt;p&gt;This service-token-plus-custom-headers pattern is exactly the shape an API team needs. If you are here to &lt;a href="https://velprove.com/for/api" rel="noopener noreferrer"&gt;monitor an API behind Cloudflare Access&lt;/a&gt; , the same two headers and a post-gate body assertion are the whole play.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The lesser option: a scoped Bypass on a health path (and why not to default to it)
&lt;/h2&gt;

&lt;p&gt;If you search this problem, the most common community answer is " just expose an unauthenticated URL so the monitor can reach it." You can do a scoped version of that with an Access &lt;strong&gt;Bypass&lt;/strong&gt; policy on a dedicated &lt;code&gt;/healthz&lt;/code&gt;-style path, so an unauthenticated monitor reaches that one path. But it is the weaker answer, and here is why.&lt;/p&gt;

&lt;p&gt;The Bypass action disables Access enforcement for the matching traffic, and those requests are not logged. That means two real losses on that path: you remove the Zero Trust protection, and you remove the audit trail. For a Zero Trust deployment, punching an unauthenticated, unlogged hole in your own gate to make monitoring easier is exactly the posture you were trying to avoid. The service token is strictly better: Access keeps enforcing the policy, the request is still evaluated and logged, and the monitor still gets through.&lt;/p&gt;

&lt;p&gt;So lead with the service token. Use Bypass only if you have a concrete reason you cannot use a token, and if you do, scope it to a minimal, non-sensitive path that returns no protected data, never to the whole app. Do not disable Access on the application and do not remove the identity provider requirement to "simplify" monitoring. The point of this post is that you do not have to.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Set it up in Velprove: an API monitor with the service-token headers, and the login monitor
&lt;/h2&gt;

&lt;p&gt;Here is the concrete setup against your own Access-protected app. Two monitors cover the two paths a real user takes: a machine path (service token) and a human path (interactive login).&lt;/p&gt;

&lt;p&gt;** 1. API monitor on the protected URL, sending the two service-token headers. ** Add an API monitor on the Access-protected hostname, or on a specific post-gate path. Set two request headers, &lt;code&gt;CF-Access-Client-Id&lt;/code&gt; and &lt;code&gt;CF-Access-Client-Secret&lt;/code&gt;, to your service token's Client ID and Client Secret. Then add two Success Conditions on verification: a 200 status, and a Response Body Contains assertion on a post-gate string the real app always renders. The body assertion is what makes a pass mean "reached the app," not "reached the login page." Without it, you are back to the false green from section one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Browser login monitor for the human path.&lt;/strong&gt; A service token covers the machine path. It does not prove a real person can sign in. For that, the no-code browser login monitor opens a real browser, loads the Access and identity provider login, fills a dedicated low-privilege test user's credentials, and asserts on a string that only renders after a successful login. It authenticates the way a human does, not with a token. One honest scope note: a form-fill login monitor handles a standard email-and-password login well, but external identity providers vary. For the nuances of OAuth, SSO, or passkey logins, see &lt;a href="https://velprove.com/blog/monitor-login-that-isnt-email-password" rel="noopener noreferrer"&gt;the guide on monitoring logins that are not email and password&lt;/a&gt; so you do not over-rely on a form fill against an IdP that does something else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Pick a region and wire alerts.&lt;/strong&gt; Each monitor runs from one region you choose. The free plan gives you ten monitor slots to spread across regions, one no-code browser login monitor, and email alerts. A useful trick: put the API monitor on a second region from a different monitor so you can catch a regional Tunnel or edge divergence that a single vantage point would miss. If your service token then starts failing in only one region, you have isolated a regional problem rather than a global one.&lt;/p&gt;

&lt;p&gt;Use a low-privilege, monitor-only test account for the browser login, never a real admin user, and never a broadly scoped credential. The service token should be permitted on exactly the application you are monitoring and nothing else. The whole point was to keep Zero Trust intact while still proving the app is reachable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why does my uptime check go green when my Cloudflare Access app is actually unreachable?
&lt;/h3&gt;

&lt;p&gt;Because a status-only check follows Cloudflare Access into the login flow and lands on the Access login page, which itself returns 200. Your monitor reports green having never touched the real app behind the gate. Default self-hosted Access typically returns a 302 redirect to the login, and the login page returns 200. Apps using Cloudflare's newer Managed OAuth may instead return a 401 with a &lt;code&gt;WWW-Authenticate&lt;/code&gt; header. The fix is to authenticate the probe with an Access service token and assert on content that only renders after the gate.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I monitor an app behind Cloudflare Access without disabling Access?
&lt;/h3&gt;

&lt;p&gt;Create an Access service token, add a policy with action Service Auth on the application, and have an API monitor send the &lt;code&gt;CF-Access-Client-Id&lt;/code&gt; and &lt;code&gt;CF-Access-Client-Secret&lt;/code&gt; headers on every request. The token satisfies the Service Auth policy, so the request passes the gate and reaches your app while Access still enforces the policy. You do not disable or weaken Access. The monitor simply authenticates the way a permitted machine client does. Velprove does this on a free API monitor that sends custom request headers, paired with a free no-code browser login monitor for the human sign-in path.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the difference between Error 1033 and a 502 on a Cloudflare Tunnel app?
&lt;/h3&gt;

&lt;p&gt;Error 1033, "Cloudflare Tunnel error," means the tunnel itself is not connected because Cloudflare cannot find a healthy &lt;code&gt;cloudflared&lt;/code&gt; instance, usually because the connector process has stopped. A 502 Bad Gateway means the tunnel is connected to Cloudflare but &lt;code&gt;cloudflared&lt;/code&gt; cannot reach the origin service defined in your ingress rule. They are two different failures: one is the tunnel being down, the other is the origin behind a working tunnel being down. An external monitor that asserts on real app content tells them apart from the Cloudflare error pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I just expose an unauthenticated health URL so my monitor can reach the app?
&lt;/h3&gt;

&lt;p&gt;You can, with a scoped Access Bypass policy on a dedicated health path, but it is the weaker option. Bypass disables Access enforcement for that traffic and those requests are not logged, so you lose protection and the audit trail on that path. A service token is the better answer: it keeps Access enforcing the policy and still lets the monitor through. If you do use Bypass, scope it to a minimal, non-sensitive path that returns no protected data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do Cloudflare's Tunnel-health and Access dashboards already tell me my app is up?
&lt;/h3&gt;

&lt;p&gt;They tell you the connector is healthy and that an Access policy exists, which is useful, but neither confirms that a real user gets through the tunnel and past the Access gate to a working app. The connector can be up while the origin behind it is down, and the policy can exist while the app returns errors. An external probe that authenticates through Access and asserts on post-gate content is what proves the end-to-end path, from outside Cloudflare's network.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do Cloudflare Access service tokens expire, and will that break my monitor?
&lt;/h3&gt;

&lt;p&gt;Yes. Service tokens have a configurable lifetime (Cloudflare's documentation uses an example of about a year) and must be rotated. A forgotten rotation turns a previously-green monitor into an authentication failure, which is actually the monitor doing its job by surfacing the expiry. Cloudflare can alert before a token expires. Rotate the token, update the headers in your monitor, and the check returns to green.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor your Cloudflare Access app free
&lt;/h2&gt;

&lt;p&gt;The trap is the false green: a status-only monitor that measures Cloudflare's login page and calls your app healthy. The fix is an external probe that authenticates through the gate. Create an Access service token, add a Service Auth policy, send the &lt;code&gt;CF-Access-Client-Id&lt;/code&gt; and &lt;code&gt;CF-Access-Client-Secret&lt;/code&gt; headers, and assert on post-gate content, so green means reached-the-app. Add a no-code browser login monitor for the human sign-in path, and pick a region per monitor to catch regional divergence.&lt;/p&gt;

&lt;p&gt;Velprove's free plan covers this: ten monitor slots you can spread across regions, custom headers on API monitors for the service token, one no-code browser login monitor, email alerts, and no credit card. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt; and point a monitor through your own Cloudflare Access gate.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Monitor Your Headless CMS Publish Path (Contentful, Sanity, Strapi)</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Sat, 06 Jun 2026 14:00:03 +0000</pubDate>
      <link>https://dev.to/velprove/monitor-your-headless-cms-publish-path-contentful-sanity-strapi-2plh</link>
      <guid>https://dev.to/velprove/monitor-your-headless-cms-publish-path-contentful-sanity-strapi-2plh</guid>
      <description>&lt;p&gt;&lt;strong&gt;Big picture:&lt;/strong&gt; Your headless CMS confirms that it published an entry. It never confirms that the build ran, the deploy succeeded, the CDN purged, and the content is actually live on your production URL. Contentful, Sanity, and Strapi all stop at the publish and, at best, log a webhook delivery attempt. None of them proactively tells you the downstream build failed. The only thing that proves the whole chain is an external prober that hits the live URL and asserts on a per-publish content canary. Velprove is that free prober: a monitor runs from any one of 5 regions, so you create one canary monitor per region and watch all five, with commercial use allowed. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start for free.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Start with a real one. In Strapi issue &lt;a href="https://github.com/strapi/strapi/issues/22884" rel="noopener noreferrer"&gt;#22884&lt;/a&gt; , opened February 13, 2025 against Strapi 5.7.0, a developer updates a related component and hits Publish. The CMS UI shows the update correctly. The published API does not. The component quietly disappears from the published response until you unpublish the page and republish it. The CMS told the truth as it understood it: it published. What it served on the API was something else, and nothing in the CMS raised a hand to say so. That gap, between what the CMS reports and what your live surface actually serves, is the whole subject of this post.&lt;/p&gt;

&lt;p&gt;This is not theoretical, and it is not unique to Strapi. Status trackers recorded Sanity webhook-delivery incidents on August 29 and August 31, 2024 (per StatusGator's Sanity Webhooks history, which is a status-tracker source rather than a primary Sanity postmortem, so take it as a directional signal, not a precise root cause). The point is general: every headless CMS sits at the start of a multi-step publish path, and the CMS can only see the first step. The rest of the chain is invisible to it, which means it is invisible to you unless you probe the end of the chain from outside.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your CMS confirms it published. It does not confirm it is live.
&lt;/h2&gt;

&lt;p&gt;Here is the honest scope of what a headless CMS knows. When you hit Publish, the CMS writes the entry to its published dataset and returns success. If you wired up a webhook to a build or deploy hook, the CMS will attempt to fire that webhook and record the attempt. That is the edge of its knowledge. It does not run your build. It does not watch your deploy. It does not fetch your production URL afterward to check that the page changed. Its job ends at the publish and the delivery attempt.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the CMS will and will not tell you
&lt;/h3&gt;

&lt;p&gt;The CMS will confirm the publish succeeded. It will, at best, log that a webhook delivery was attempted and whether the receiving server returned a success status. It will not proactively tell you the build that webhook triggered failed, because it never sees the build. And the delivery log only records that the webhook was delivered, never whether the new content went live on your site. A webhook can be delivered perfectly, your build can still fail on a type error, and your CMS log shows a clean green delivery the entire time. The handoff it records and the outcome you care about are two different things.&lt;/p&gt;

&lt;h2&gt;
  
  
  The publish path is a chain, and you can only see the last link from outside
&lt;/h2&gt;

&lt;p&gt;Trace what actually happens after you click Publish. The entry is saved to the published dataset. A webhook fires to a build or deploy hook. A build runs. The build artifact deploys. A CDN purge or revalidation clears the old copy. Then, and only then, the new content is live at your production URL. That is six handoffs: publish, webhook, build, deploy, CDN purge, live.&lt;/p&gt;

&lt;p&gt;Each handoff can fail silently. The webhook can time out and never retry. The build can fail on a broken import or a flaky dependency. The deploy can succeed but ship the wrong artifact. The CDN purge can miss an edge node and keep serving the old page. Every one of those failures leaves the CMS reporting a clean publish, because the CMS is upstream of all of them.&lt;/p&gt;

&lt;p&gt;You cannot watch every link from where you sit. What you can do is probe the last link. If the content is live at the production URL, the entire chain worked, by definition. If it is not, something in the chain broke, and you do not need to know which link to know you have a problem. An external probe of the live URL is the single probe that proves the whole pipeline, because the live URL is the only place where all six handoffs have either succeeded or failed together.&lt;/p&gt;

&lt;h2&gt;
  
  
  The content-canary pattern
&lt;/h2&gt;

&lt;p&gt;This is the wedge, so it is worth being precise. A content canary is a per-publish value that your build emits into the rendered page, and that a monitor asserts on from outside. The value changes on every build. A good canary is a &lt;code&gt;data-publish-id&lt;/code&gt; attribute on the body, a meta tag carrying a build identifier, or a small token returned from a canary endpoint you control, such as a &lt;code&gt;/healthz&lt;/code&gt; or &lt;code&gt;/canary&lt;/code&gt; route that reports the currently-deployed content version.&lt;/p&gt;

&lt;p&gt;It is easy to confuse this with two neighbors, so here is the distinction. A &lt;a href="https://velprove.com/blog/monitor-landing-page-content" rel="noopener noreferrer"&gt;static-string assertion&lt;/a&gt; checks that a fixed phrase like your headline is present. That proves the page renders, but a stale cached copy contains the same headline, so a static string cannot tell fresh from stale. A build-SHA assertion ties the page to the exact commit that built it, which is great for catching a rolled-back deploy. A content canary sits in between and is aimed squarely at the publish path: it changes when content is published and rebuilt, so when a publish does not propagate, the canary the monitor reads is still the old one, and the monitor goes red.&lt;/p&gt;

&lt;p&gt;There is one landmine. If you hardcode the exact current canary value into your monitor and walk away, the next legitimate publish changes the value and your monitor goes red on a perfectly healthy deploy. The fix is to assert on the canary's shape rather than its frozen literal, or to refresh the monitor's expected value from your deploy step. We do not re-derive that here; the freshness toolkit, including refreshing a monitor's expected value with a &lt;code&gt;PUT /api/checks/&amp;lt;id&amp;gt;&lt;/code&gt; as part of your deploy, lives in our &lt;a href="https://velprove.com/blog/monitor-nextjs-app-production" rel="noopener noreferrer"&gt;Next.js production monitoring guide&lt;/a&gt; .&lt;/p&gt;

&lt;h3&gt;
  
  
  Where the canary lives
&lt;/h3&gt;

&lt;p&gt;The canary lives in your infrastructure, not Velprove's, so this is one place where showing your own code is fair game. The cheapest version is a meta tag your build writes with the publish identifier the CMS gives you:&lt;/p&gt;

&lt;p&gt;Or, if you prefer a dedicated endpoint your monitor can hit without parsing HTML, a tiny canary route that returns the currently-deployed content version:&lt;/p&gt;

&lt;p&gt;Either shape gives the monitor something that moves on every publish. From there, the assertion is on you to point at it, which is the setup section below.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contentful
&lt;/h2&gt;

&lt;p&gt;Contentful fires webhooks for publish events, and the retry behavior is the first thing that bites. A webhook delivery is retried up to 3 times over roughly one minute, after which it is considered failed. A request that takes longer than the 30-second timeout is treated as failed with no retry at all. Those failures do not page you. They land in a pull-only activity log that you have to go look at, and that log is capped at a maximum of 500 entries, so a busy account can roll the evidence of a failure off the end before anyone notices.&lt;/p&gt;

&lt;p&gt;On the read side, Contentful serves its Content Delivery API from a CDN cache. That cache is why the API is fast, and it is also why an edge node can serve a stale copy of an entry after you publish. We are not going to put a number on Contentful staleness, because Contentful does not publish one and inventing a figure would be worse than useless. The shape of the risk is what matters: a successful publish plus a delivered webhook plus a cached edge can still mean a visitor in one region sees yesterday's content. A canary assertion on the live URL, run from multiple regions, is how you see that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sanity
&lt;/h2&gt;

&lt;p&gt;Sanity is the cleanest illustration of published not equalling what the edge serves, because it splits the two into different hostnames. GROQ webhooks retry twice with a 30-second timeout per attempt, so a slow or flaky deploy hook gets a short, finite number of chances and then the delivery is done. On the read side, Sanity documents a stale window of up to two hours of last-cached content from its cached API if Content Lake is unavailable, served from &lt;code&gt;apicdn.sanity.io&lt;/code&gt;, while &lt;code&gt;api.sanity.io&lt;/code&gt; returns fresh, uncached results.&lt;/p&gt;

&lt;p&gt;That split is the whole lesson in two hostnames. If your front end reads from the cached host for speed, a publish can land in Content Lake and your visitors can still see the previous content for the cached window. Querying the fresh host proves the data is in Sanity. Probing your actual live URL proves what your visitors get, which is the thing that matters. The canary tells you which of the two you are actually serving.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strapi
&lt;/h2&gt;

&lt;p&gt;Strapi is the other half of the split, and it fails differently. Strapi is self-hosted, so there are two consequences that Contentful and Sanity do not have. First, Strapi has no built-in webhook retries. Its docs tell you to build retry logic yourself, which means a failed deploy-hook delivery is a single dropped event unless you wrote the retry. Second, Strapi ships no vendor CDN, so edge staleness only exists if you added your own cache in front of it.&lt;/p&gt;

&lt;p&gt;The Strapi failure that bites is the deploy. The webhook Strapi fires triggers a build or deploy on your side, and that deploy can fail while Strapi reports a clean publish and a delivered webhook. And as issue &lt;a href="https://github.com/strapi/strapi/issues/22884" rel="noopener noreferrer"&gt;#22884&lt;/a&gt; shows, even the published API itself can return stale or dropped relations after a publish, with no warning from the CMS. So the split is this: with Contentful and Sanity the common failure is the edge serving stale content; with Strapi the common failure is the triggered deploy failing or the published API returning stale relations.&lt;/p&gt;

&lt;p&gt;Strapi does give you one clean probe of the server itself. Its &lt;code&gt;/_health&lt;/code&gt; endpoint returns HTTP &lt;code&gt;204&lt;/code&gt; with a header of &lt;code&gt;strapi: You are so French!&lt;/code&gt;. Both the status code and the header are assertable, so a monitor can confirm the Strapi server is up and is genuinely Strapi, separate from confirming that your live site shows the new content. The health probe tells you the CMS is alive. The canary probe tells you the publish reached the visitor. You want both.&lt;/p&gt;

&lt;h2&gt;
  
  
  How is this different from monitoring a Next.js app or a landing page?
&lt;/h2&gt;

&lt;p&gt;Fair question, because the surfaces overlap and the wrong post wastes your time. The boundary is about what failure you are hunting.&lt;/p&gt;

&lt;p&gt;If you want the general freshness toolkit for a production app, including how to keep an assertion value current across deploys, that is &lt;a href="https://velprove.com/blog/monitor-nextjs-app-production" rel="noopener noreferrer"&gt;monitoring a Next.js app in production&lt;/a&gt; . If you just want to assert that a page renders the right copy and images, with generic body assertions, that is &lt;a href="https://velprove.com/blog/monitor-landing-page-content" rel="noopener noreferrer"&gt;monitoring landing page content&lt;/a&gt; . If you want a heartbeat that returns a 503 when your app knows it is serving stale data, that is the pattern in &lt;a href="https://velprove.com/blog/api-health-check-patterns" rel="noopener noreferrer"&gt;API health check patterns&lt;/a&gt; . And if the question is whether your front-end host itself deployed and is reachable, that is hosting, covered in &lt;a href="https://velprove.com/blog/monitor-vercel-hosted-site" rel="noopener noreferrer"&gt;monitoring a Vercel-hosted site&lt;/a&gt; and &lt;a href="https://velprove.com/blog/monitor-cloudflare-workers-pages-site" rel="noopener noreferrer"&gt;monitoring a Cloudflare Workers or Pages site&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;This post owns one specific thing the others do not: the CMS-to-live-URL publish path, where the CMS says published, the host says deployed, and the visitor still sees stale content because a link between them broke. The content canary is the assertion built for exactly that gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Set this up in Velprove
&lt;/h2&gt;

&lt;p&gt;You build all of this in the wizard, no code on the Velprove side. Everything here is free on every plan: no-code browser login monitors, multi-step API monitors up to 3 steps, and a choice of 5 monitoring regions. The publish path monitors below all fit inside the free plan, with commercial use allowed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The content-canary monitor.&lt;/strong&gt; Create an HTTP monitor pointed at your live production URL. Step one: GET the live URL. Step two: assert that the response body contains your canary token, the same &lt;code&gt;x-publish-id&lt;/code&gt; value or canary string your build emits. That single assertion proves the entire publish path ran, because the token only appears once the new build is live at the edge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Strapi health monitor.&lt;/strong&gt; For a self-hosted Strapi server, create a second HTTP monitor against &lt;code&gt;/_health&lt;/code&gt; and assert the status code equals &lt;code&gt;204&lt;/code&gt; and that the response header &lt;code&gt;strapi&lt;/code&gt; contains &lt;code&gt;You are so French!&lt;/code&gt;. That confirms the Strapi process is up and is genuinely Strapi, separate from whether your live site reflects the latest publish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The optional multi-step chain.&lt;/strong&gt; If you want to drive a publish and then prove it propagated, a 3-step multi-step API monitor does it. Step one: call your own publish-trigger endpoint and capture the returned publish id into a variable. Step two: wait. Step three: GET your own canary endpoint with that captured id in the URL, for example &lt;code&gt;GET https://yoursite.com/canary?expect={{publishId}}&lt;/code&gt;, and have that endpoint compare the live content against the id and return &lt;code&gt;200&lt;/code&gt; when it matches or a non-2xx when it does not. Velprove then asserts step three's status code equals &lt;code&gt;200&lt;/code&gt;. The captured id flows into the request URL, where &lt;code&gt;{{publishId}}&lt;/code&gt; is interpolated for real, and the comparison runs server-side on your own infrastructure. That chain proves not just that the site is up, but that a publish you triggered actually reached the edge.&lt;/p&gt;

&lt;p&gt;The canary endpoint that does the comparison is your code, so here is the shape of it:&lt;/p&gt;

&lt;p&gt;Run any of these from multiple regions so a CDN edge that serves stale content in one geography cannot hide behind four healthy ones. A monitor runs from any one of 5 regions, so you create one canary monitor per region. The free plan allows up to 10 monitors, which is more than enough to cover all five regions on every plan.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Set up a free headless CMS publish-path monitor with Velprove&lt;/a&gt; . Five regions, no credit card, commercial use allowed. Probe the live URL, assert the canary, and find out a publish did not propagate before your readers do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does my CMS tell me when a build or deploy fails after I publish?
&lt;/h3&gt;

&lt;p&gt;No. Your CMS confirms that it published the entry and, at best, logs that it attempted to deliver a webhook to your build or deploy hook. It will not proactively tell you the build failed, and the log only covers webhook delivery, never whether content actually went live on your production URL. Contentful, Sanity, and Strapi all stop caring once the entry is published and the webhook attempt is recorded. The only thing that proves the whole chain is an external prober that hits the live URL and asserts on what it gets back.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a content canary, and why assert on it instead of a fixed string?
&lt;/h3&gt;

&lt;p&gt;A content canary is a per-publish value that your build emits into the rendered page, for example a &lt;code&gt;data-publish-id&lt;/code&gt; attribute, a meta tag, or a token returned by a small canary endpoint you control. A monitor asserts that this value is present and, ideally, that it changed after a publish. A fixed string assertion only proves the page is rendering at all. It cannot tell a fresh deploy from a stale cached copy that still contains the same words. The canary distinguishes published from live because it changes per build, so when a publish does not propagate, the canary the monitor sees is the old one.&lt;/p&gt;

&lt;h3&gt;
  
  
  How often does a headless CMS serve stale content from its CDN after publishing?
&lt;/h3&gt;

&lt;p&gt;There is no single published number that applies to every CMS. Contentful serves its Content Delivery API from a CDN cache, so an edge node can return a stale copy after a publish, but Contentful does not publish a staleness figure. Sanity documents a window of up to two hours of last-cached content from its cached API endpoint if Content Lake is unavailable. Strapi has no vendor CDN, so staleness only exists if you put your own cache in front of it. The honest answer is that the window varies by provider, by endpoint, and by your own caching, which is exactly why you measure it from outside with a canary instead of trusting a documented number.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I monitor a self-hosted Strapi instance the same way as Contentful or Sanity?
&lt;/h3&gt;

&lt;p&gt;Mostly yes, with one extra probe. The content-canary assertion on your live URL works identically for Strapi. The difference is that Strapi is self-hosted, has no vendor CDN, and does not retry webhooks for you, so the deploy its webhook triggers can fail and Strapi will not tell you. Strapi also exposes a health endpoint at &lt;code&gt;/_health&lt;/code&gt; that returns HTTP &lt;code&gt;204&lt;/code&gt; with a &lt;code&gt;strapi: You are so French!&lt;/code&gt; header, both of which are assertable, so you can probe that the Strapi server itself is alive separately from probing that your live site shows the new content.&lt;/p&gt;

&lt;h3&gt;
  
  
  My monitor's assertion value goes stale every publish. How do I keep it current?
&lt;/h3&gt;

&lt;p&gt;Assert on the canary's shape rather than its literal value, or refresh the monitor's expected value programmatically from your deploy step. Hardcoding the exact published value and walking away guarantees a red monitor on the next legitimate publish. Our &lt;a href="https://velprove.com/blog/monitor-nextjs-app-production" rel="noopener noreferrer"&gt;Next.js production monitoring guide&lt;/a&gt; covers the freshness toolkit in full, including how to update a monitor's expected value with a &lt;code&gt;PUT /api/checks/&amp;lt;id&amp;gt;&lt;/code&gt; as part of your deploy. Use that pattern rather than re-deriving it here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need a paid plan to monitor my CMS publish path from multiple regions?
&lt;/h3&gt;

&lt;p&gt;No. A Velprove monitor runs from any one of 5 regions, and you create one monitor per region; the free plan allows up to 10 monitors, so you can cover all five geographies without paying. The free plan also includes multi-step API monitors of up to 3 steps and no-code browser login monitors, and it allows commercial use. A content-canary monitor on your live URL, the Strapi &lt;code&gt;/_health&lt;/code&gt; probe, and a small trigger-then-poll multi-step chain all fit inside the free plan, so you can prove the publish path from multiple regions without paying.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Monitor a Shopify App You Built: Webhooks and OAuth</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Fri, 05 Jun 2026 14:00:02 +0000</pubDate>
      <link>https://dev.to/velprove/monitor-a-shopify-app-you-built-webhooks-and-oauth-31hk</link>
      <guid>https://dev.to/velprove/monitor-a-shopify-app-you-built-webhooks-and-oauth-31hk</guid>
      <description>&lt;p&gt;** The quick version: To monitor a Shopify app you built, run a no-code API monitor on your webhook receiver and a second one on your app's API surface, from multiple regions, because a browser monitor cannot script the per-shop install flow. Shopify gives your webhook receiver a one-second connection timeout and a five-second total budget, and only a &lt;code&gt;200&lt;/code&gt; counts as an acknowledgement. Miss that window and Shopify retries a failed webhook 8 times over roughly 4 hours, then keeps going on its own schedule while your deliveries quietly drop on the floor. Shopify never pages you about it, and it publishes no uptime SLA for your app, so the merchant who built a workflow on your app finds out first, then tells you. Velprove probes the endpoints Shopify will hit and the fields your app depends on from 5 regions, free to start and no-code to set up. It does not complete the Shopify OAuth or install consent, because that is a per-shop, merchant-authenticated grant no synthetic login can replay. **&lt;/p&gt;

&lt;h2&gt;
  
  
  Are you the merchant, or the developer who built the app?
&lt;/h2&gt;

&lt;p&gt;First, make sure you are on the right page. If you run a Shopify business and you are worried about whether buyers can complete a purchase, you want the merchant guides, not this one. Start with &lt;a href="https://velprove.com/blog/monitor-shopify-store-uptime" rel="noopener noreferrer"&gt;how to monitor your Shopify store uptime&lt;/a&gt; for the front-of-house reachability angle, and &lt;a href="https://velprove.com/blog/monitor-shopify-checkout-flow" rel="noopener noreferrer"&gt;monitoring the checkout path&lt;/a&gt; for the buy-side flow. Both of those, plus the segment overview at &lt;a href="https://velprove.com/for/shopify" rel="noopener noreferrer"&gt;Velprove for Shopify&lt;/a&gt;, are written for the person running the storefront.&lt;/p&gt;

&lt;p&gt;This post is for the other person. You built an embedded Shopify admin app and the webhook consumers behind it. You own a receiver that Shopify POSTs to, an API surface your app calls, and an OAuth install flow each merchant runs once. None of that shows up on a storefront status page, and none of it is covered by a merchant uptime check. When your app breaks, it breaks in a place only you are responsible for, and the platform that hosts you will not tell you it happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually breaks in an app you built (and Shopify never tells you)
&lt;/h2&gt;

&lt;p&gt;Two surfaces in an app you built fail silently, and both of them sit outside anything Shopify reports on your behalf.&lt;/p&gt;

&lt;p&gt;The first is your webhook receiver. Shopify sends a topic like an order creation or an app uninstall as an HTTPS POST to a URL you registered. If that URL starts answering slowly, returning a redirect, throwing a 500, or rejecting valid signatures after a deploy, Shopify treats the delivery as failed and moves into a retry path. From the outside your process looks alive. The web server is up, the health check is green, and yet the deliveries that drive your app are not landing.&lt;/p&gt;

&lt;p&gt;The second is your app's own API surface, the calls your embedded admin UI makes and the Admin API requests your backend issues. A schema change, a rotated credential, or a retired API version can change what comes back without changing the status code. The page renders, the request returns &lt;code&gt;200&lt;/code&gt;, and a field your code relied on is suddenly missing or shaped differently.&lt;/p&gt;

&lt;p&gt;Here is the uncomfortable part: Shopify does not publish a numeric uptime SLA for your app. It reviews you for things like HMAC verification of webhooks and for performance, including a guideline that your app should not drop Lighthouse performance scores by more than ten points, but it does not promise or measure your availability for you. Reachability after you go live is your problem. The stakes are not abstract: when a delivery silently fails, the merchant who depends on that automation usually notices before you do, and you hear about your own outage as a support ticket. Monitoring exists to flip that order.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Shopify delivers webhooks: the 5-second budget
&lt;/h2&gt;

&lt;p&gt;The webhook mechanics are documented precisely, and they are the spine of the whole monitoring argument. Shopify's &lt;a href="https://shopify.dev/docs/apps/build/webhooks/subscribe/https" rel="noopener noreferrer"&gt;HTTPS webhook delivery documentation&lt;/a&gt; spells out the timing your receiver lives inside.&lt;/p&gt;

&lt;p&gt;Shopify delivers each webhook as an HTTPS POST. It allows a one-second connection timeout and a five-second timeout for the entire request. The only response that counts as a successful acknowledgement is a &lt;code&gt;200&lt;/code&gt;. Anything outside the 200 range is treated as an error, and that explicitly includes &lt;code&gt;3xx&lt;/code&gt; redirects. A receiver that 302s to a canonical host, or sits behind a redirect you forgot about, is failing every delivery even though a browser would follow the redirect without complaint.&lt;/p&gt;

&lt;p&gt;When Shopify receives no response or an error, it retries 8 times over the next 4 hours. Those are the real numbers; do not round them up to a week of retries or a different count. During those four hours your deliveries are queued and re-attempted, which means a receiver that is slow or erroring is dropping real events the entire time, not just at the moment it broke.&lt;/p&gt;

&lt;p&gt;There is a deletion consequence, but it is narrower than the scary version you have probably heard. After 8 consecutive failures, Shopify automatically deletes the subscription only if it was configured using the Admin API, meaning a shop-specific subscription. App-specific subscriptions declared in your app configuration are not deleted by Shopify. So the blanket claim that Shopify deletes your webhook after failures is wrong for config-declared subscriptions, and you should not repeat it. The thing that is always true, regardless of subscription type, is that during the failure window deliveries are not being processed by your app, and the first human to notice is usually the merchant, not you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why HMAC verification makes your receiver fragile
&lt;/h2&gt;

&lt;p&gt;Every Shopify webhook arrives signed, and the signature is where a lot of receivers quietly break. Each HTTPS delivery includes a base64-encoded signature in the &lt;code&gt;X-Shopify-Hmac-Sha256&lt;/code&gt; header. Your handler is expected to recompute it: take the HMAC-SHA256 of the &lt;strong&gt;raw request body&lt;/strong&gt; using your app's client secret, base64-encode the result, and compare it against the header value with a constant-time comparison. If they match, the delivery genuinely came from Shopify. If they do not, the correct response is a &lt;code&gt;401&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fragility is in the details. The HMAC must be computed over the raw, unparsed body, which means any body-parsing middleware that runs before verification corrupts the bytes and breaks the comparison. The secret is your app client secret, also called the API secret key, not an access token and not a separate signing key, so a credential rotation that updates one place and not the other turns every valid delivery into a rejected one. A subtle bug here does not crash your app. It turns a healthy-looking endpoint into a silent &lt;code&gt;401&lt;/code&gt; machine that Shopify reads as a string of failures.&lt;/p&gt;

&lt;p&gt;This is also why the compliance side matters. Every app distributed through the Shopify App Store must answer the mandatory compliance webhooks, &lt;code&gt;customers/data_request&lt;/code&gt;, &lt;code&gt;customers/redact&lt;/code&gt;, and &lt;code&gt;shop/redact&lt;/code&gt;, even if your app stores no customer data. Those handlers are held to the same bar: acknowledge with a 200-series status, return &lt;code&gt;401&lt;/code&gt; on a bad HMAC. They are real routes that must stay alive, so they belong in your monitoring too. Shopify webhook HMAC signing is its own scheme, separate from how other providers sign; if you also handle &lt;a href="https://velprove.com/blog/monitor-stripe-webhooks" rel="noopener noreferrer"&gt;Stripe webhook signatures&lt;/a&gt; , the idea is similar but the header names and retry behavior differ, so do not assume one verifier covers both.&lt;/p&gt;

&lt;p&gt;For monitoring, the honest boundary is this. A monitor cannot stand in for your HMAC verifier, and it should not pretend to. What it can do is confirm the endpoint Shopify will hit is answering correctly and fast enough, on a route you control, which is the next section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor the webhook receiver: assert 200 inside Shopify's window
&lt;/h2&gt;

&lt;p&gt;The right tool for the receiver is an API monitor that calls the endpoint Shopify will call and asserts on what comes back. Velprove probes the endpoint from five regions and checks the things that map directly to Shopify's delivery rules.&lt;/p&gt;

&lt;p&gt;Point the monitor at a health or canary route on your receiver, not at a real webhook topic that mutates data, and assert three things. Assert the status code equals &lt;code&gt;200&lt;/code&gt;, because that is the only response Shopify accepts and the only one that proves your route is the live handler rather than a stale page or a redirect. Assert the response time is under a threshold you tune below Shopify's five-second total budget, so you find out the receiver is getting slow before it starts timing out under real load. Assert a body marker or a response header you control, so a &lt;code&gt;200&lt;/code&gt; from a generic 404 page or a load-balancer splash does not pass as healthy.&lt;/p&gt;

&lt;p&gt;Velprove can assert on the body content of the response, for example that a known marker string is present or that an error string is absent, and it can assert on the response headers your endpoint returns. That header assertion is useful for confirming your receiver answers with the headers you expect, for instance a content type or a custom marker header you set on the health route. To be precise about the boundary: Velprove asserts on the values your endpoint returns. It does not compute or send a valid &lt;code&gt;X-Shopify-Hmac-Sha256&lt;/code&gt; signature, and it does not deliver a signed test payload your verifier would accept. It probes the route and checks the answer.&lt;/p&gt;

&lt;p&gt;Two limits to be honest about. Velprove does not receive inbound webhooks; there is no passive receiver waiting for Shopify to POST to Velprove, so it cannot confirm a real delivery landed. It probes the endpoint that Shopify will hit, on your schedule, from the outside. And if you want to chain calls, for instance fetch a token on one step and reuse it on the next, that is a multi-step pattern. The mechanics of chaining requests and extracting values between steps are covered in our &lt;a href="https://velprove.com/blog/multi-step-api-monitoring-guide" rel="noopener noreferrer"&gt;multi-step API monitoring guide&lt;/a&gt; , and the same engine drives a single receiver probe. The receiver check itself is often a one-step monitor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your app breaks quietly when an API version retires
&lt;/h2&gt;

&lt;p&gt;The receiver is one silent-failure surface. The API version your app targets is the other. Shopify releases a new API version every three months at the start of the quarter, named by date in the &lt;code&gt;YYYY-MM&lt;/code&gt; form, for example &lt;code&gt;2026-04&lt;/code&gt;. Each stable version is supported for a minimum of twelve months, with at least nine months of overlap between consecutive versions, so you get a generous window. The trap is what happens when that window closes.&lt;/p&gt;

&lt;p&gt;Retirement is not a 404. If your app targets a version that is no longer accessible, Shopify falls forward and responds using the oldest accessible stable version. Your request still succeeds, still returns a &lt;code&gt;200&lt;/code&gt;, and now carries a newer serialization of the data, with fields renamed, removed, or reshaped. This is schema drift, not an endpoint disappearing, and it is exactly the kind of break a status check sails right past.&lt;/p&gt;

&lt;p&gt;Monitor it the same way you would monitor any data contract you depend on. Point a monitor at a representative call into your own app's surface and assert on the response shape, not just the status. Assert that a field you genuinely rely on is present and not null, so the day the serialization changes and that field goes missing, the monitor fails instead of your customers. If part of your surface is GraphQL, remember that a &lt;code&gt;200&lt;/code&gt; can still carry a populated &lt;a href="https://velprove.com/blog/monitor-graphql-api-errors-array" rel="noopener noreferrer"&gt;GraphQL errors array&lt;/a&gt; even when the transport looks fine; that post covers the errors-array assertion in full, so this one will not repeat it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest limit: why a browser-login monitor can't watch your install flow
&lt;/h2&gt;

&lt;p&gt;There is a real limit worth stating plainly, because it shapes which tool you reach for. You cannot point a synthetic browser-login monitor at your app's install or OAuth flow and have it complete the install.&lt;/p&gt;

&lt;p&gt;Completing a Shopify app install requires an interactive consent screen that Shopify renders inside a specific shop's admin, where an authenticated merchant deliberately clicks to grant your app the access scopes it asked for. That screen is per-shop and merchant-authenticated. There is no static, replayable login that a monitor owns, because the thing being authenticated is a particular merchant's session in a particular store, not a generic test account on your app's own domain. A browser monitor that fills a username and password into a login form cannot manufacture that consent. This is the same class of limit that applies to &lt;a href="https://velprove.com/blog/monitor-login-that-isnt-email-password" rel="noopener noreferrer"&gt;OAuth and SSO logins that are not a plain email and password&lt;/a&gt; : when the credential is a delegated, interactive grant, a form-fill monitor has nothing to fill.&lt;/p&gt;

&lt;p&gt;It helps to know how modern embedded apps actually authenticate, so you scope the limit correctly. An embedded admin app authenticates the merchant's in-admin session with App Bridge and short-lived session tokens, which are JWTs with a one-minute lifetime, and exchanges those for API access tokens via token exchange. Shopify now recommends managed installation plus token exchange for embedded apps, so they no longer need to implement the authorization code grant for installation. That grant is not deprecated or removed across the platform, though; standalone apps still use it. The takeaway is narrow and important: the install and OAuth consent step is what a browser monitor cannot script. It does not mean you cannot monitor anything about your Shopify app. Your webhook receiver, your API surface, and much of your app's own backend are all monitorable with API and multi-step monitors.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Velprove can and cannot assert for your Shopify app
&lt;/h2&gt;

&lt;p&gt;Keeping the boundary explicit saves you from building a monitor on a wrong assumption. Here is the honest split.&lt;/p&gt;

&lt;p&gt;Velprove &lt;strong&gt;can&lt;/strong&gt; probe your webhook receiver and your app API endpoints from five regions. It can assert on the status code, on the response time against a threshold, on the response body content with present-or-absent string checks, and on the response headers your endpoint returns. It can chain multiple API calls in a multi-step monitor and extract a value from one step to reuse in the next. It can run a browser-login monitor against your app's own login surface if your app exposes a plain username-and-password login form of its own, with one success indicator.&lt;/p&gt;

&lt;p&gt;Velprove &lt;strong&gt;cannot&lt;/strong&gt; receive an inbound webhook, because there is no passive receiver on Velprove's side waiting for Shopify to POST to it. It cannot compute or send a valid Shopify HMAC-signed payload as a signing step, because it asserts on returned values rather than acting as a request-signing engine. It cannot complete the interactive, per-shop OAuth or install consent. It cannot read a merchant's inbox or click a magic link. And it does not judge the semantic quality of a response; it checks shape, latency, status, headers, and body content, not whether the data is business-correct. If you need the same generic pattern for a non-Shopify dependency, the approach we describe for &lt;a href="https://velprove.com/blog/monitor-payment-gateway-not-stripe" rel="noopener noreferrer"&gt;monitoring a payment gateway that is not Stripe&lt;/a&gt; uses the same probe-and-assert engine from five regions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Set this up in Velprove (no code)
&lt;/h2&gt;

&lt;p&gt;You can build both monitors in the wizard without writing any code. Velprove is free to start and no-code for the setup, and the receiver and version-shape monitors below both fit comfortably inside the free plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the webhook receiver monitor.&lt;/strong&gt; Create an API monitor, point it at the health or canary route on your receiver, and add the three assertions from the receiver section: status code equals &lt;code&gt;200&lt;/code&gt;, response time under your tuned threshold, and a body or header marker that proves it is the live handler. Choose multiple regions so a partial or regional reachability problem is caught rather than masked by a single vantage point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the version-shape monitor.&lt;/strong&gt; Create a second monitor against a representative call into your own app's surface and assert that a field you depend on is present and not null. That is the monitor that catches the silent schema drift when an API version retires and Shopify falls forward to a newer serialization.&lt;/p&gt;

&lt;p&gt;Wire alerting to wherever you already triage, and you have flipped the order: when your receiver slows past Shopify's budget, or a delivery path breaks, or a field you depend on quietly goes missing, you hear about it from your monitor instead of from the merchant who built their day around your app.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start monitoring your Shopify app with Velprove&lt;/a&gt; . Free, no-code setup, five regions, commercial use allowed. Probe the endpoints Shopify will hit and the fields your app depends on, and find out before your merchants do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How long does my Shopify app have to respond to a webhook?
&lt;/h3&gt;

&lt;p&gt;Shopify allows a one-second connection timeout and a five-second timeout for the entire request. Only a &lt;code&gt;200&lt;/code&gt; response counts as an acknowledgement. Anything outside the 200 range, including &lt;code&gt;3xx&lt;/code&gt; redirects, is treated as a failure. If your receiver answers slowly, redirects, or returns a non-200 status, Shopify considers the delivery failed and enters its retry path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Shopify really delete my webhook subscription after failures?
&lt;/h3&gt;

&lt;p&gt;Only in one specific case. If Shopify receives no response or an error, it retries 8 times over the next 4 hours. After 8 consecutive failures, the subscription is automatically deleted only if it was configured using the Admin API, meaning a shop-specific subscription. App-specific subscriptions declared in your app configuration are not deleted by Shopify. Either way, deliveries are dropping during the failure window, which is the problem a monitor catches first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can Velprove verify my Shopify HMAC signature for me?
&lt;/h3&gt;

&lt;p&gt;No. Velprove does not compute or send a Shopify HMAC-signed payload, and it is not a request-signing engine. What Velprove can do is probe the endpoint Shopify will hit and assert that it returns &lt;code&gt;200&lt;/code&gt; within the five-second budget, plus assert on the response headers and body that the endpoint returns. Your own handler is responsible for verifying the &lt;code&gt;X-Shopify-Hmac-Sha256&lt;/code&gt; header over the raw request body with your app client secret.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why can't a browser monitor sign into my app's Shopify install flow?
&lt;/h3&gt;

&lt;p&gt;The install and OAuth consent step is an interactive, per-shop, merchant-authenticated screen that Shopify renders inside that store's admin. A synthetic browser monitor with a static test login cannot script it, because there is no replayable login that the monitor owns and the consent is a deliberate human action scoped to one shop. For install and OAuth coverage, the right tool is an API or webhook-endpoint monitor, not a browser-login monitor.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens to my app when a Shopify API version retires?
&lt;/h3&gt;

&lt;p&gt;Shopify does not return a 404 for a retired version. It falls forward and responds using the oldest accessible stable version, so your code receives a newer serialization with changed or removed fields. The break is silent schema drift, not an endpoint disappearance. A monitor that asserts on the response shape, for example that a field you depend on is present and not null, catches the day the drift starts returning something your code did not expect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I still need to monitor the mandatory compliance webhooks?
&lt;/h3&gt;

&lt;p&gt;Yes. Every app distributed through the Shopify App Store must respond to the mandatory compliance webhooks, &lt;code&gt;customers/data_request&lt;/code&gt;, &lt;code&gt;customers/redact&lt;/code&gt;, and &lt;code&gt;shop/redact&lt;/code&gt;, even if the app stores no customer data. They should acknowledge receipt with a 200-series status and return a &lt;code&gt;401&lt;/code&gt; on an invalid HMAC. Those receiver routes are endpoints you must keep alive and answering, so they belong in your monitoring alongside your business webhooks.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Monitor a Self-Hosted App Stack You Run Yourself</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:00:05 +0000</pubDate>
      <link>https://dev.to/velprove/monitor-a-self-hosted-app-stack-you-run-yourself-4hf6</link>
      <guid>https://dev.to/velprove/monitor-a-self-hosted-app-stack-you-run-yourself-4hf6</guid>
      <description>&lt;p&gt;&lt;strong&gt;Here is the structural problem:&lt;/strong&gt; a monitor that lives on, or right next to, the thing it watches cannot, by construction, see that thing become unreachable from the public internet. When the box goes down it takes the monitor with it. When the home connection drops, both go quiet together. To know whether real users can actually reach your self-hosted app, you need a monitor that runs somewhere else, off your box and off your network. This post is about adding that one external layer on top of the internal tooling you already run, not ripping anything out. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start for free.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At around 00:40 local time on March 10, 2021, a fire broke out in a power-supply room at OVHcloud's Strasbourg campus. As a precaution, electricity was cut to the entire site, which took SBG1 through SBG4 offline at once, and SBG2, a building holding roughly 14,046 servers, was destroyed. OVH told customers in plain language to activate their disaster recovery plan ( &lt;a href="https://www.theregister.com/2021/03/10/ovh_strasbourg_fire/" rel="noopener noreferrer"&gt;The Register, March 10 2021&lt;/a&gt; ; see also &lt;a href="https://corporate.ovhcloud.com/en/newsroom/news/informations-site-strasbourg/" rel="noopener noreferrer"&gt;the OVHcloud newsroom&lt;/a&gt; ).&lt;/p&gt;

&lt;p&gt;Now run the homelab version of that event in your head. Your app is on a box in the closet, or on a single VPS. Your monitor, the thing whose whole job is to tell you the app went down, is a container on the same box, or another VPS in the same datacenter. The power goes. The uplink goes. The site goes. And your monitor goes dark right alongside the thing it was supposed to be watching, so the alert you were counting on never fires. It is not that the monitor failed. It is that it could never have succeeded. A probe that shares fate with the app cannot report the app unreachable, because the same event that makes the app unreachable also silences the probe.&lt;/p&gt;

&lt;h2&gt;
  
  
  The blind spot is not Kuma. It is anything that shares fate with your box.
&lt;/h2&gt;

&lt;p&gt;It is tempting to make this a story about one tool, but the problem is structural, not a product flaw. A self-hosted Uptime Kuma instance has the blind spot. So does a Prometheus blackbox exporter scraping your app from the same network, a Netdata agent on the host, a curl-cron in a tmux session, a self-hosted Healthchecks.io receiving dead-man pings, and the app's own &lt;code&gt;/health&lt;/code&gt; endpoint. Every one of them is excellent at what it does. Every one of them sits on or beside the box and depends on that box, its network, and its power to function. The moment the failure is "the whole thing is unreachable from outside," the on-box observer is part of what went dark.&lt;/p&gt;

&lt;p&gt;This is not an argument against self-hosting your monitoring. It is an argument about where one specific question has to be answered from. If you want the tool-by-tool comparison of self-hosted Uptime Kuma against a hosted service, including the real cost of running it yourself, that is a separate post: &lt;a href="https://velprove.com/blog/uptime-kuma-vs-hosted-monitoring" rel="noopener noreferrer"&gt;Uptime Kuma vs hosted monitoring&lt;/a&gt; . Here, the only claim is narrower: the observer that has to confirm public reachability cannot be one that shares fate with the thing it watches.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five things external monitoring sees that your box can't
&lt;/h2&gt;

&lt;p&gt;Here are the failure shapes where a same-host probe shows green, or shows nothing at all, while a real user on the public internet gets nothing back.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Host down
&lt;/h3&gt;

&lt;p&gt;A kernel panic, an out-of-memory kill that takes the box with it, a full disk that wedges every write, or a power loss. The host is gone, and so is any monitor running on it. An external probe simply stops getting answers and alerts.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Outbound or network partition
&lt;/h3&gt;

&lt;p&gt;The box is up and the app is happily serving on localhost, but the uplink is down or a firewall rule has cut the path to the internet. An on-box check passes because the app answers locally. Nobody outside can reach it. Only a probe coming from outside notices the partition.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Regional DNS or CDN failure
&lt;/h3&gt;

&lt;p&gt;Your app is fine, but the DNS record fails to resolve in some resolvers, or the CDN or proxy in front of it returns errors in one region. A single-location check sees whichever path happens to work. A multi-region external monitor catches the market where resolution or the edge is broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Whole-datacenter outage
&lt;/h3&gt;

&lt;p&gt;The OVH-class event, or a Hetzner-class one. Power, cooling, or networking for the entire facility goes, and every box you have there, including a co-located monitor, goes with it. A monitor in a different region is the only thing left standing to tell you the site is dark.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The box's own internet is broken
&lt;/h3&gt;

&lt;p&gt;The homelab special. Your home ISP drops, the router reboots itself after a firmware update, the dynamic IP rotates and your DNS has not caught up, or the NAT mapping silently expires. The box is on, the app is running, the LAN is happy, and the public internet sees nothing. An external monitor is the only observer that is not on the wrong side of your own front door.&lt;/p&gt;

&lt;p&gt;Notice the through-line: in each case a probe on the box can be green while real users get nothing. That gap between a green internal signal and a broken public experience is the whole subject of &lt;a href="https://velprove.com/blog/why-uptime-monitors-miss-outages" rel="noopener noreferrer"&gt;why uptime monitors miss real outages&lt;/a&gt; , and the datacenter case is also why you want an independent record of your provider's downtime when you go to &lt;a href="https://velprove.com/blog/verify-hosting-provider-uptime" rel="noopener noreferrer"&gt;verify your host's uptime claims&lt;/a&gt; against an SLA.&lt;/p&gt;

&lt;h2&gt;
  
  
  A spare Pi is not the fix
&lt;/h2&gt;

&lt;p&gt;The obvious homelab reflex is to add a second box. Put a second Raspberry Pi on the shelf, run the monitor there, and now if the first box dies the second one notices. That is genuinely useful for host-level failures, and you should not feel bad for doing it. But be honest about what it does and does not cover.&lt;/p&gt;

&lt;p&gt;A second local box shares almost everything with the first one: the same home internet connection, the same router, the same power strip, the same ISP, the same public IP, the same physical building. If you run two VPS instances in the same datacenter, they share that datacenter's power and network. When the shared thing fails, both boxes fail together, and your second monitor confirms only that the app still answers on the LAN. It cannot confirm that anyone on the public internet can reach you, because it is sitting on the same broken side of the connection. The only observer that can confirm public reachability is one that runs off your network entirely.&lt;/p&gt;

&lt;p&gt;So the framing is not external instead of internal. It is external on top of internal. Keep the spare box if you like the host-level redundancy. Add one external probe for the question neither box can answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  "But my box is behind NAT with no open ports. How would an external monitor even reach it?"
&lt;/h2&gt;

&lt;p&gt;This is the most common objection from people running a homelab, and it is a fair one. If you have not forwarded a port, there is no public address for an outside monitor to hit. You do not have to open one. The standard answer is an outbound-only tunnel, and the most common is Cloudflare Tunnel via the &lt;code&gt;cloudflared&lt;/code&gt; daemon. It runs on your box, dials out to Cloudflare, and Cloudflare publishes a public hostname that routes back down that connection to your local service. No inbound port is opened. Your router stays closed. Your app gets a real, public URL.&lt;/p&gt;

&lt;p&gt;A minimal public-hostname mapping in a &lt;code&gt;cloudflared&lt;/code&gt; config looks like this on your side:&lt;/p&gt;

&lt;p&gt;With that running, &lt;code&gt;app.example.com&lt;/code&gt; is a public URL backed by your local &lt;code&gt;localhost:8080&lt;/code&gt;, and an external monitor can fetch it exactly the way a visitor would, traversing the same tunnel, proxy, and TLS a real request takes. If your app is genuinely private-only and you never want it on the public internet, do not force it. Monitor the public edge you do expose, a reverse proxy, a status endpoint, or a VPN gateway, and keep an internal monitor for the private services behind it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor the login, not just the landing page
&lt;/h2&gt;

&lt;p&gt;A landing page returning 200 is the weakest possible proof that your self-hosted app works. The part your users actually depend on is the login, and a login can break while the homepage stays perfectly green. A same-host &lt;code&gt;/health&lt;/code&gt; probe never sees it, because the process is up and answering; it is the sign-in flow that is broken.&lt;/p&gt;

&lt;p&gt;This is where Velprove's differentiator sits. The free, no-code browser login monitor opens a real browser, navigates to your self-hosted Nextcloud, Gitea, Immich, or Home Assistant login, fills in a dedicated test user's credentials, and asserts that a string on the post-login page actually rendered. It signs into your own login the way a person would, catching the broken-login-behind-a-200 case that a status check sails straight past. There is no Playwright script to write; you fill in the URL, the credentials, and the success string in a wizard.&lt;/p&gt;

&lt;p&gt;Pair that with the free multi-step API monitor when a single request is not enough. On the free plan you get up to 3 steps, which is plenty to chain a token request into an authenticated read and assert on what comes back. Every Velprove monitor, including the free ones, runs from 5 regions you can choose from, one per monitor, so the regional and datacenter cases above are covered by where the check runs, not just what it checks. The multi-step mechanics are walked through in the &lt;a href="https://velprove.com/blog/multi-step-api-monitoring-guide" rel="noopener noreferrer"&gt;multi-step API monitoring guide&lt;/a&gt; .&lt;/p&gt;

&lt;h2&gt;
  
  
  Set it up against your own stack in a few minutes
&lt;/h2&gt;

&lt;p&gt;The concrete setup, in order, against your self-hosted stack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. HTTP monitor on the public URL.&lt;/strong&gt; Create an HTTP monitor and point it at your public URL or tunnel hostname. On the verification step, add two Success Conditions: a status code of 200, and a Response Body Contains assertion on a string your correct page always renders. Pick something load-bearing, a known piece of copy or a marker your template emits, so a blank page or a wrong upstream fails the check even when the status code is 200.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Browser login monitor on the login.&lt;/strong&gt; Add a browser login monitor on your self-hosted login URL, fill in the dedicated test user's credentials, and set Success verification to a string that only appears once sign-in succeeds. This is the layer that catches a login regression a homepage check never sees.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Optional multi-step API monitor.&lt;/strong&gt; If your app has an API worth verifying end to end, chain a multi-step monitor: request a token, call an authenticated endpoint, assert on the response. Three steps are available on the free plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Pick regions and wire alerts.&lt;/strong&gt; Choose the region closest to your real users for the primary check, and use a second region on another monitor to catch regional DNS or CDN divergence. Add your email for alerts.&lt;/p&gt;

&lt;p&gt;On the customer side, a simple &lt;code&gt;/health&lt;/code&gt; route and a curl-cron for internal checks are fine to keep. A bare reachability one-liner like &lt;code&gt;curl -fsS https://app.example.com/health || alert&lt;/code&gt; from a box on your network tells you the app answers locally, which is useful, and is exactly the LAN-only confirmation the external monitor exists to go beyond.&lt;/p&gt;

&lt;p&gt;This is the same external-correctness pattern indie hackers use on side projects that happen to run on managed hosts instead of a closet box, which is laid out in the &lt;a href="https://velprove.com/blog/uptime-monitoring-indie-hackers-side-projects" rel="noopener noreferrer"&gt;indie hacker free monitoring stack&lt;/a&gt; . The surface differs, the principle does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  External and internal together: the layered setup
&lt;/h2&gt;

&lt;p&gt;The end state is not one tool. It is two layers that do not overlap. Keep Prometheus, Netdata, or a local Uptime Kuma for what they are good at: resource graphs, container and disk health, internal-only services that never face the public internet, and notification routing inside your network. Those tools see deep into the box, and an external monitor never will.&lt;/p&gt;

&lt;p&gt;Then add the external multi-region synthetic for the one thing your internal stack physically cannot do: confirm that the public can reach you. It is the observer that is not on your box, not on your home internet, and not in your datacenter, so it is still up to notice when all of those go dark. For the full cost-and-feature comparison of running Kuma yourself versus a hosted service, the dedicated post is &lt;a href="https://velprove.com/blog/uptime-kuma-vs-hosted-monitoring" rel="noopener noreferrer"&gt;Uptime Kuma vs hosted monitoring&lt;/a&gt; . The split this post argues for is simple: internal tools for what is happening inside, one external monitor for whether anyone outside can get in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I monitor a self-hosted app from outside my own network?
&lt;/h3&gt;

&lt;p&gt;You point a monitor that runs somewhere other than your network at the public address your app answers on. That public address is either a port you forward through your router, a reverse proxy or VPS that fronts the app, or an outbound tunnel like Cloudflare Tunnel that gives the app a public hostname without opening any inbound ports. The monitor then fetches that URL from one or more regions on a schedule and alerts you when the fetch fails or the page no longer contains the content it should. The point is that the monitor lives off your box and off your home internet, so when your box, your uplink, or your whole site goes dark, the external monitor is still up to notice. With Velprove you create an HTTP monitor on the public URL, add a body assertion on a known string, choose your regions, and wire up email alerts. It is free, with no credit card.&lt;/p&gt;

&lt;h3&gt;
  
  
  My self-hosted app is behind NAT with no open ports. Can it still be monitored externally?
&lt;/h3&gt;

&lt;p&gt;Yes, if you give it a public hostname. The standard way to do this without opening any inbound ports is Cloudflare Tunnel: you run the &lt;code&gt;cloudflared&lt;/code&gt; daemon on your box, it makes an outbound-only connection to Cloudflare, and Cloudflare publishes a public hostname that routes back down that tunnel to your local service. Nothing is exposed on your router, no port is forwarded, yet your app now has a real public URL. You point an external monitor at that hostname and you are monitoring the exact path a real visitor takes. If your app is genuinely private and you never want it on the public internet, then monitor the public edge you do expose, such as a reverse proxy, a status endpoint, or a VPN gateway, and keep an internal monitor for the private services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can a free monitor sign into my self-hosted Nextcloud or Gitea login?
&lt;/h3&gt;

&lt;p&gt;Yes. Velprove's browser login monitor opens a real browser, navigates to your login page, fills in a dedicated test user's email and password, and asserts that a string on the post-login page actually rendered. That works against a self-hosted Nextcloud, Gitea, Immich, or Home Assistant login the same way it works against any hosted SaaS, because it drives the same login form a person would. It is no-code: you fill in the URL, the credentials, and the success string in a wizard, with no Playwright script to write. The free plan includes one browser login monitor at a 15-minute interval. Use a low-privilege test account, never your admin credentials, so the monitor can do nothing more than confirm the login works.&lt;/p&gt;

&lt;h3&gt;
  
  
  Will external monitoring replace Prometheus, Netdata, or Uptime Kuma on my box?
&lt;/h3&gt;

&lt;p&gt;No, and you should not try to make it. The two layers answer different questions. Prometheus, Netdata, and a local Uptime Kuma instance see what is happening inside your box and your network: CPU, memory, disk, container health, internal services, and they route notifications. They are very good at that. What they cannot do, by construction, is tell you whether the public internet can still reach you, because they share fate with the box and the network they live on. An external multi-region monitor answers only that one question and answers it well. Keep your internal tooling for resources and internal services, and add the external monitor for public reachability. For the tool-specific cost and feature comparison, see our &lt;a href="https://velprove.com/blog/uptime-kuma-vs-hosted-monitoring" rel="noopener noreferrer"&gt;Uptime Kuma vs hosted monitoring&lt;/a&gt; post.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is a second Raspberry Pi enough redundancy for monitoring my homelab?
&lt;/h3&gt;

&lt;p&gt;It depends on what failure you want to catch. A second Pi on the same shelf catches the first box crashing, and that is worth something. What it cannot catch is everything the two boxes share: the same home internet connection, the same router, the same power circuit, the same ISP, the same public IP. When your ISP drops, your router reboots, or the power blinks, both Pis go dark together, and the spare confirms only that the app still answers on your LAN, not that anyone on the public internet can reach it. A second local box improves redundancy for host-level failures and does nothing for the failures that share fate with your whole site. Only a monitor that runs off your network can confirm public reachability.&lt;/p&gt;

&lt;h3&gt;
  
  
  What can an external monitor catch that my app's own &lt;code&gt;/health&lt;/code&gt; endpoint can't?
&lt;/h3&gt;

&lt;p&gt;Everything that happens between your app and a real visitor, plus everything that takes the app itself offline. Your app's &lt;code&gt;/health&lt;/code&gt; endpoint runs inside the app process. It can tell you the app thinks it is healthy, but it cannot answer a request once the process has crashed, the host has panicked, the disk is full, the uplink is down, or the datacenter has lost power, because it needs the app to be running to respond at all. An external monitor sees the failure precisely because it is asking from outside: the fetch times out or errors, and it alerts. It also catches a reverse proxy serving the wrong upstream, an expired certificate, a DNS or CDN failure in front of the app, and a login that returns 200 but no longer works. For what a &lt;code&gt;/health&lt;/code&gt; endpoint should and should not contain, see our &lt;a href="https://velprove.com/blog/monitor-rest-api-health-endpoint" rel="noopener noreferrer"&gt;REST API health endpoint guide&lt;/a&gt; .&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor your self-hosted stack free
&lt;/h2&gt;

&lt;p&gt;The Velprove free plan covers 10 monitors, one no-code browser login monitor that signs into your own self-hosted app, multi-step API monitors up to 3 steps, and 5 regions to choose from, one per monitor. Commercial use is allowed and no credit card is required. Keep your Prometheus, Netdata, or Uptime Kuma for the inside of the box, and add the one external layer that confirms the public can still reach you. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start with the free plan&lt;/a&gt;. The first monitor takes a few minutes to set up.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>The Road to 47-Day Certificates: When Renewal Automation Fails Silently</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Wed, 03 Jun 2026 14:00:04 +0000</pubDate>
      <link>https://dev.to/velprove/the-road-to-47-day-certificates-when-renewal-automation-fails-silently-363k</link>
      <guid>https://dev.to/velprove/the-road-to-47-day-certificates-when-renewal-automation-fails-silently-363k</guid>
      <description>&lt;p&gt;&lt;strong&gt;The bottom line:&lt;/strong&gt; a certificate that auto-renews on schedule can still take your site down, because the cert your server holds on disk is not always the cert it is serving in memory, the chain it sends can break even while the leaf is valid, and a green days-remaining badge cannot see either of those. Maximum public TLS certificate lifetimes are shrinking on a fixed schedule, from 398 days down to 200 as of March 15 2026, then 100 in 2027, then 47 in 2029, which multiplies renewal events and with them the chances for a silent failure. To catch these failures you need a check that talks to the live socket and confirms a real client can complete the handshake and load the page, not just a number that counts down the leaf certificate. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start for free.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The road to 47 days (and why it changes the math)
&lt;/h2&gt;

&lt;p&gt;For years the practical answer to "how long does a certificate last" was about a year, and renewal was a once-a-year chore you could half forget about. That era is ending on a published schedule. The maximum lifetime of a public TLS certificate is stepping down, and each step means you renew more often, which means the renewal machinery runs more often, which means it has more opportunities to fail quietly.&lt;/p&gt;

&lt;h3&gt;
  
  
  What actually got decided, and when
&lt;/h3&gt;

&lt;p&gt;In April 2025 the CA/Browser Forum, the body that sets the baseline rules every public certificate authority and major browser follows, adopted Ballot SC-081v3. It was proposed by Apple's Clint Wilson and endorsed by Sectigo, Google Chrome, and Mozilla ( &lt;a href="https://cabforum.org/2025/04/11/ballot-sc081v3-introduce-schedule-of-reducing-validity-and-data-reuse-periods/" rel="noopener noreferrer"&gt;CA/Browser Forum, Ballot SC-081v3&lt;/a&gt; , verified 2026-06-03). It sets a schedule that ratchets the maximum TLS certificate validity down in steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;398 days&lt;/strong&gt;, the prior maximum, before the schedule begins. &lt;strong&gt;200 days&lt;/strong&gt; starting March 15 2026. &lt;strong&gt;100 days&lt;/strong&gt; starting March 15 2027. &lt;strong&gt;47 days&lt;/strong&gt; starting March 15 2029.&lt;/p&gt;

&lt;p&gt;Domain-validation data-reuse windows shrink in lockstep over the same timeline, trending toward roughly ten days for the validation tied to the names on a certificate. The exact day-counts at each intermediate step are detailed in the ballot itself, so treat the precise numbers as a moving target and the direction as the certainty: re-validation has to happen far more often, so the parts of issuance you used to set and forget now run on a much tighter loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  More renewals = more chances to fail silently
&lt;/h3&gt;

&lt;p&gt;The lifetime number is not the risk. The risk is the renewal event, and shorter lifetimes give you more of them. A certificate that renews once a year exercises your renewal automation, your reload hook, and your chain configuration once a year. Drop to 47 days and the same machinery fires roughly eight times a year. Every one of those runs is a chance for the cron to be stopped, for a deploy hook to be missing, for a reload to be skipped, or for the served chain to drift. The failure modes are not new. Shorter lifetimes just hand you many more rolls of the same dice, and a once-a-year problem becomes a several-times-a-year problem.&lt;/p&gt;

&lt;p&gt;That is the whole argument for watching renewals as a recurring event rather than treating a fresh certificate as a job done. Set-and-forget was always optimistic. On the road to 47 days it stops being viable at all.&lt;/p&gt;

&lt;h3&gt;
  
  
  A separate timeline you should not conflate
&lt;/h3&gt;

&lt;p&gt;One thing to keep straight: Let's Encrypt is also shortening its certificates, but that is a separate decision from the CA/Browser Forum mandate above. Let's Encrypt has said it intends to cut its own default lifetime from 90 days toward 45 by 2028, and it offers an opt-in short-lived profile of around six days for setups that want it ( &lt;a href="https://letsencrypt.org/2025/12/02/from-90-to-45" rel="noopener noreferrer"&gt;Let's Encrypt, from 90 to 45&lt;/a&gt; , verified 2026-06-03). Do not merge the two timelines in your head. The CA/B schedule is the industry-wide ceiling everyone lives under; the Let's Encrypt change is one issuer moving faster than the ceiling requires. Both point the same way, shorter lifetimes and more renewals, which is the only takeaway you need for monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a "days remaining" badge is structurally blind
&lt;/h2&gt;

&lt;p&gt;A days-remaining check reads one field off the leaf certificate: the &lt;code&gt;notAfter&lt;/code&gt; date. It subtracts today from that date and shows you a number. That number is genuinely useful for one job, catching a certificate that is about to lapse, and you should run it. For the expiry-countdown setup and the 30-15-7 threshold rule that governs how to act on that number, see our guide on &lt;a href="https://velprove.com/blog/ssl-certificate-expiry-monitoring" rel="noopener noreferrer"&gt;SSL certificate expiry monitoring&lt;/a&gt; . That post owns the countdown. This one starts where the countdown goes blind.&lt;/p&gt;

&lt;p&gt;The badge is structurally blind because both of the failures in this post leave the leaf's &lt;code&gt;notAfter&lt;/code&gt; date completely healthy. A renewal that succeeds on disk but never reloads the server still has a valid leaf; the number is fine. A served chain that breaks because an intermediate is missing or a cross-sign expired still has a valid leaf; the number is fine. In both cases the page is down for real clients while the badge stays green. That is the most dangerous shape an outage can take, a green dashboard over a broken site, and it is exactly the pattern we walk through in &lt;a href="https://velprove.com/blog/anatomy-of-a-silent-outage" rel="noopener noreferrer"&gt;the anatomy of a silent outage&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;The fix is not a better expiry number. It is a different question. Instead of asking "how many days until the leaf expires," you ask "can a standard client complete the handshake and load the page right now." The rest of this post is the two ways the answer to that second question goes no while the first question still looks fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cert on disk is not cert in memory
&lt;/h2&gt;

&lt;p&gt;The first trap is that renewal and serving are two separate acts. Renewal writes new certificate files to disk. Serving reads a certificate that the web server loaded into memory when it last started. Those two can drift apart, and when they do, a check that inspects the file on disk will swear everything is fine while the live socket serves a different cert.&lt;/p&gt;

&lt;h3&gt;
  
  
  The renew-without-reload gap
&lt;/h3&gt;

&lt;p&gt;Here is the exact sequence. Your renewal client, certbot for example, wakes up on its timer, talks to the certificate authority, and writes a fresh certificate and key to disk. The renewal command reports success. The files are correct. The expiry date on disk is now months out. And your web server keeps serving the old certificate, because Nginx, Apache, and most load balancers load the certificate into memory at startup and do not re-read the file just because it changed. Until the process is reloaded, what is on the wire is the previous certificate, counting down toward its real expiry while the disk says all is well ( &lt;a href="https://eff-certbot.readthedocs.io/en/stable/using.html" rel="noopener noreferrer"&gt;Certbot documentation on deploy hooks&lt;/a&gt; , verified 2026-06-03).&lt;/p&gt;

&lt;p&gt;The fix is a deploy hook, sometimes called a reload hook, that runs only when a certificate actually changes and reloads the web server as the last step of renewal. Certbot has first-class support for this; you give it a command to run on a successful renewal, and that command reloads the server so memory catches up with disk. Without that hook, renewal and serving stay decoupled, and you have manufactured a window where the files lie about what your visitors are actually getting. The reason this is worth a whole section is that it defeats the most common do-it-yourself check, a script that reads the cert file or asks certbot when things expire. That script reads disk. Your customers hit memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  The stale cert cached at the load balancer or CDN
&lt;/h3&gt;

&lt;p&gt;The same disk-versus-memory split happens one layer out, at your load balancer or CDN, and it is sneakier because the box that renews and the box that serves are not even the same machine. A renewal completes on the origin, but the certificate that actually faces the internet lives on a load balancer, a reverse proxy, or a CDN edge that holds its own copy. If the new certificate is not pushed to that layer, or the layer does not reload, the edge serves a stale cert to the world while the origin is perfectly up to date.&lt;/p&gt;

&lt;p&gt;This gets worse with a CDN that has many points of presence, because the new certificate can propagate to some edges and not others. Now the cert a visitor sees depends on which edge they hit, which usually means it depends on where in the world they are. One market gets the fresh certificate, another market gets the stale one, and a single-location check sees whichever edge happens to answer it. The only way to catch a geo-divergent certificate is to look from more than one place at once, which is the multi-region angle the closing section comes back to.&lt;/p&gt;

&lt;h2&gt;
  
  
  The chain can break while the leaf is valid
&lt;/h2&gt;

&lt;p&gt;The second trap has nothing to do with expiry dates at all. A certificate is not trusted in isolation. A client trusts it by building a path from your leaf certificate, through one or more intermediate certificates, up to a root certificate it already trusts. If any link in that path is missing or expired in the chain your server sends, the client cannot complete the path, and it rejects the connection, even though your leaf says it is valid for months.&lt;/p&gt;

&lt;h3&gt;
  
  
  How a served chain breaks
&lt;/h3&gt;

&lt;p&gt;The most common cause is the simplest: the server is configured to send only the leaf certificate and omits the intermediate bundle. A lot of clients paper over this by fetching the missing intermediate themselves or by caching one they have seen before, which is exactly why it is so dangerous. It works in your browser, it works on your laptop, and it fails on the strict client that does not do that extra fetching. The chain was always incomplete; you just had a forgiving client.&lt;/p&gt;

&lt;p&gt;The other cause is timing inside the chain itself. Certificate authorities sometimes cross-sign an intermediate or root with an older, more widely trusted certificate so that older devices can still build a path. That cross-signing certificate has its own expiry date, separate from your leaf. When the cross-sign expires, clients that were relying on it to reach a trusted root suddenly cannot, even though your leaf and the newer chain are completely valid. The mechanism is the same as a missing intermediate: a path that used to resolve no longer does, and your leaf is innocent the whole time.&lt;/p&gt;

&lt;h3&gt;
  
  
  A dated illustration: the DST Root CA X3 cross-sign expiry, 2021
&lt;/h3&gt;

&lt;p&gt;The clearest real example is history, not a current threat, so read it as a 2021 case study rather than a live warning. On September 30 2021, the DST Root CA X3 cross-sign that Let's Encrypt had relied on reached its expiry date. Servers were sending a chain that built up to that old root for the benefit of older clients. When it expired, those older clients, notably older Android versions and systems on OpenSSL 1.0.x, could no longer build a trusted path and started rejecting connections, even when they actually did trust Let's Encrypt's own newer root, ISRG Root X1 ( &lt;a href="https://letsencrypt.org/docs/dst-root-ca-x3-expiration-september-2021/" rel="noopener noreferrer"&gt;Let's Encrypt, DST Root CA X3 expiration, September 2021&lt;/a&gt; , verified 2026-06-03).&lt;/p&gt;

&lt;p&gt;The point of the example is not the specific roots. It is the shape. Every affected leaf certificate had a perfectly valid &lt;code&gt;notAfter&lt;/code&gt; date the entire time. A days-remaining check would have shown green through the whole event. The thing that expired was a link in the chain, not the certificate the badge was watching, and the clients that broke were the ones with no human to click through. For diagnosing a chain on a specific host, the right tool is a chain inspector like &lt;a href="https://www.ssllabs.com/ssltest/" rel="noopener noreferrer"&gt;SSL Labs&lt;/a&gt; , which walks the served chain and grades it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automated clients hard-fail where browsers click through
&lt;/h2&gt;

&lt;p&gt;Both traps above share a property that makes them genuinely hard to notice: the way you check matters more than whether you check. Open the site in your own browser and you are using the most forgiving client that exists. An interactive browser that hits a broken chain or an expired cert shows a full-page warning and, in most cases, offers a way to proceed anyway. A human reads the warning, clicks through, and the page loads. From your chair, the site works.&lt;/p&gt;

&lt;p&gt;Automated clients do not get that choice. A cron job calling your API, a server-side SDK talking to a partner, a webhook delivery from a payment processor, a mobile app refreshing data, and an external monitor all fail the TLS handshake and throw an error. There is no "proceed anyway" for a machine. So the exact integrations you depend on, machine to machine, the ones with nobody sitting there to override a warning, are the ones that break first and loudest, while your own browser tells you everything is fine.&lt;/p&gt;

&lt;p&gt;This is why a real client check beats a human spot-check, and it is the same reasoning behind treating any automated dependency as something you verify on a schedule rather than trust by default. The patterns for proving an automated client can actually complete a call live in our &lt;a href="https://velprove.com/blog/api-health-check-patterns" rel="noopener noreferrer"&gt;API health check patterns guide&lt;/a&gt; . A broken cert is just one more reason a machine-to-machine call fails closed while a browser sails through.&lt;/p&gt;

&lt;h2&gt;
  
  
  What catches this when a days-remaining check can't
&lt;/h2&gt;

&lt;p&gt;Here is the honest scope of what Velprove does, because the wedge only works if it is true. Velprove's SSL probe reads the leaf certificate: its expiry date and its issuer. It does not introspect or report the certificate chain. So Velprove does not inspect your chain, and any tool that claims to walk and grade your full chain is doing something Velprove's SSL probe does not. For that, use a dedicated chain inspector like &lt;a href="https://www.ssllabs.com/ssltest/" rel="noopener noreferrer"&gt;SSL Labs&lt;/a&gt; , the same recommendation as our expiry guide.&lt;/p&gt;

&lt;p&gt;What Velprove does catch is the symptom, which is the part your customers actually feel. Run an HTTP monitor with a content assertion against your live URL, and it fails closed whenever the server serves a chain a standard client will not trust, because the fetch itself errors out on the handshake. The content assertion goes further: it checks that a known piece of your page actually rendered, so a broken page, a mixed-content failure, or a wrong certificate being served all surface as a failed assertion that a bare leaf-expiry reading sails straight past. You are not asking "is the leaf expiry date fine." You are asking "can a real client load this page," which is the question both traps in this post answer no to.&lt;/p&gt;

&lt;p&gt;Because every Velprove monitor runs from a region you choose, and 5 regions are available on every plan including free, the same check catches a CDN serving a stale or geo-divergent certificate in one market. Deploy the same monitor in more than one region and the market where the edge is serving the wrong cert goes red while the others stay green. That is the geo-divergence problem from the disk-versus-memory section, caught from outside.&lt;/p&gt;

&lt;p&gt;The renew-without-reload case is caught the same way, and for the same reason. The monitor talks to the live socket, not the file on disk. If the edge or origin is still serving the old certificate from memory, the monitor sees the old certificate, because that is what a real visitor sees. A check that reads the file would be fooled. A check that opens a connection cannot be, because it is asking the same thing your customer is.&lt;/p&gt;

&lt;p&gt;One safety note if you wire this up against an authenticated path. Use a dedicated, low-privilege test account, never real admin credentials, so a monitoring credential can do nothing more than confirm a page loads. That same least-privilege rule shows up in the broader &lt;a href="https://velprove.com/blog/monitoring-mistakes-small-business" rel="noopener noreferrer"&gt;monitoring mistakes small businesses make&lt;/a&gt; , which already lists having no SSL alerts at all as one of them. The upgrade this post argues for is going one step past an expiry alert to a functional check that proves a real client can connect.&lt;/p&gt;

&lt;p&gt;Velprove is a solo-founder uptime tool built for exactly this kind of gap. Its standout feature, the free no-code browser login monitor that signs into your own application with a real browser, is overkill for a pure HTTPS health check, but it is there on the free plan if your site sits behind a login. For this job, a single HTTP monitor with a content assertion, run from the regions you care about, is enough to turn a silent cert failure into an alert. Commercial use is allowed on every plan, including free, with no credit card.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  My cert auto-renewed but the site still serves the old one. Why?
&lt;/h3&gt;

&lt;p&gt;Because the new certificate is on disk, but your web server is still holding the old one in memory. Tools like certbot can renew the certificate files successfully and write fresh ones to disk, yet Nginx, Apache, or your load balancer keep serving the previously loaded certificate until the process is reloaded. The fix is a deploy or reload hook that reloads the web server every time a renewal succeeds. Certbot supports a deploy hook that runs only when a certificate actually changes. Until that reload runs, what is on disk and what is on the wire are two different certificates, and a check that reads the file on disk would tell you everything is fine while the live socket serves a soon-to-expire cert.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a broken certificate chain if the certificate itself is valid?
&lt;/h3&gt;

&lt;p&gt;A served chain is the leaf certificate plus the intermediate certificates a client needs to build a path back to a trusted root. The leaf can be perfectly valid, unexpired, and correctly issued, while the chain a client receives is incomplete or contains an expired cross-sign. That happens most often when a server is configured to send only the leaf and omits the intermediate bundle, or when a cross-signing certificate the chain relied on reaches its own expiry date. A standard client cannot build a trusted path, so it rejects the connection even though the leaf says it is valid for months. A days-until-expiry reading of the leaf never sees this, because the leaf is genuinely fine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Will a "days until expiry" check catch a renewal-automation failure?
&lt;/h3&gt;

&lt;p&gt;Partially, and only one kind. A days-remaining check catches the certificate that is genuinely about to lapse, which is the case the threshold rule is built for. It does not catch a renew-without-reload gap, where the disk has a fresh cert but the server still serves the old one, and it does not catch a broken served chain, where the leaf is valid but a standard client cannot trust it. Both of those fail while the days-remaining number still looks healthy. You need a check that talks to the live socket and confirms a real client can complete the handshake and load the page, not just one that reads the expiry field off the leaf.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are SSL certificates really 47 days now?
&lt;/h3&gt;

&lt;p&gt;Not yet. The CA/Browser Forum adopted a schedule that steps the maximum public TLS certificate lifetime down over several years: 200 days starting March 15 2026, 100 days starting March 15 2027, and 47 days starting March 15 2029. So 47 days is the 2029 endpoint, not today's number. The trend that matters now is the direction. Every step shortens the lifetime, which means more renewal events per year, which means more chances for a renewal to succeed on disk but fail on the wire. Shorter lifetimes do not make renewal automation more fragile by themselves; they just multiply how often you run it, and anything you run more often you should watch more carefully.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do browsers and API clients react to a broken cert the same way?
&lt;/h3&gt;

&lt;p&gt;No, and the difference is why a broken cert can be invisible to you. An interactive browser shows a full-page warning and usually offers a way to proceed anyway, so a human can click through and the page still loads for them. Automated clients do not. A cron job, an SDK, a server-to-server API call, a webhook delivery, and an external monitor all fail the TLS handshake and throw an error with no human to override it. So your own browser can look fine while every machine-to-machine integration you depend on is failing closed. The clients that break first are exactly the ones with nobody watching.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is this different from SSL expiry monitoring?
&lt;/h3&gt;

&lt;p&gt;Expiry monitoring answers one question: how many days until the leaf certificate lapses, and it is the right tool for that. For the expiry-countdown setup and the 30-15-7 threshold rule, see our guide on &lt;a href="https://velprove.com/blog/ssl-certificate-expiry-monitoring" rel="noopener noreferrer"&gt;SSL certificate expiry monitoring&lt;/a&gt; . This post covers the failures that a days-remaining number cannot see: a renewal that wrote a fresh cert to disk but never reloaded the server, and a served chain that breaks while the leaf is still valid. Both show a healthy expiry number and a broken site at the same time. The two are complementary. Run an expiry check for the countdown, and run a functional check against the live socket so a real client failure surfaces even when the days-remaining badge is green.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start watching the symptom, not just the date
&lt;/h2&gt;

&lt;p&gt;Renewal automation is a good thing, and you should keep it. The mistake is believing that a successful renewal and a green days-remaining badge mean a working site. They do not. A cert can renew on disk and never reach memory, a chain can break while the leaf is valid for months, and both failures hide behind a healthy expiry number while every automated client fails closed. On the road to 47-day certificates you will run renewals several times more often than you do today, so the odds of one of these slipping through go up, not down.&lt;/p&gt;

&lt;p&gt;Catch the symptom from outside. Point an HTTP monitor with a content assertion at your live HTTPS URL, run it from the regions your customers are in, and you get an alert the moment a real client cannot load the page, whatever the underlying cert problem is. Pair it with an expiry countdown for the leaf, and for the countdown setup and the 30-15-7 threshold rule, see &lt;a href="https://velprove.com/blog/ssl-certificate-expiry-monitoring" rel="noopener noreferrer"&gt;SSL certificate expiry monitoring&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;&lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Set up a free HTTPS monitor with Velprove&lt;/a&gt; . The region you choose, a content assertion, no credit card, commercial use allowed on every plan including free.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Payment Gateway Monitoring: What Breaks When It's Not Stripe</title>
      <dc:creator>velprove</dc:creator>
      <pubDate>Wed, 03 Jun 2026 14:00:03 +0000</pubDate>
      <link>https://dev.to/velprove/payment-gateway-monitoring-what-breaks-when-its-not-stripe-1ipo</link>
      <guid>https://dev.to/velprove/payment-gateway-monitoring-what-breaks-when-its-not-stripe-1ipo</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Payment gateway monitoring breaks the moment you copy a Stripe monitor onto something that is not Stripe, and that is the fastest way to get a false DOWN alert on PayPal or Square. Stripe uses a static secret key that never expires. PayPal and Square use OAuth bearer tokens that do, so a cloned monitor fires when its token lapses, not when the gateway is down. Adyen avoids that trap with API-key auth but hides its webhook signature in the JSON body instead of a header. This post is the per-vendor delta for PayPal, Square, and Adyen, the table of what actually differs, and how to build a token-refreshing monitor for free with no code. &lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Start for free.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Square went dark for 15 hours and never called you
&lt;/h2&gt;

&lt;p&gt;On September 7 2023, Square went down. The outage began at 1:54 PM ET and was not resolved until 5:19 AM ET the next morning, roughly 15 hours later. Square's own incident summary traces the root cause to a DNS failure: an unrelated change to host-based firewalls combined with a DNS service upgrade overloaded the internal DNS servers and knocked them over ( &lt;a href="https://developer.squareup.com/blog/incident-summary-2023-09-07/" rel="noopener noreferrer"&gt;Square's incident summary, 2023-09-07&lt;/a&gt; , verified 2026-06-02). For most of that window, sellers on Square Online could not process payments at all.&lt;/p&gt;

&lt;p&gt;Here is the part that matters for you. Square has a status page. Square published a postmortem. Square told the world. What Square did not do, what no payment gateway does, is call you when your specific integration with it stops working. The status page tracks Square's platform. It does not track whether your token expired this morning, whether your webhook endpoint started returning 500 after last night's deploy, or whether the checkout SDK quietly fails to load for your customers in one region. That gap is yours to close, and it is the same on every gateway that is not Stripe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Payment gateway monitoring with the Stripe playbook does not port
&lt;/h2&gt;

&lt;p&gt;Most monitoring advice on the internet assumes Stripe. If you are on Stripe, that advice is good and you should follow it. See &lt;a href="https://velprove.com/blog/monitor-stripe-api-health" rel="noopener noreferrer"&gt;monitor its API health&lt;/a&gt; for the outbound side and &lt;a href="https://velprove.com/blog/monitor-stripe-webhooks" rel="noopener noreferrer"&gt;monitoring Stripe webhooks&lt;/a&gt; for the inbound side. The problem starts when you copy that playbook onto PayPal, Square, or Adyen, because three structural things are different and each one breaks a copied monitor in a different way.&lt;/p&gt;

&lt;p&gt;First, the &lt;strong&gt;auth model&lt;/strong&gt;. Stripe authenticates with a static secret key that never expires. PayPal and Square authenticate with OAuth bearer tokens that do expire. Adyen uses an API key or basic auth. A monitor that pastes one fixed credential into a header works forever on Stripe and breaks on a schedule on PayPal and Square.&lt;/p&gt;

&lt;p&gt;Second, the &lt;strong&gt;webhook signature mechanics&lt;/strong&gt;. Stripe signs events one way, PayPal a second way with a set of headers and a verify endpoint, Square a third way with a single header, and Adyen a fourth way that is not a header at all. A handler or monitor written to read a signature header silently misses Adyen.&lt;/p&gt;

&lt;p&gt;Third, the &lt;strong&gt;status-page machine-readability&lt;/strong&gt;. Stripe and Square expose status JSON a monitor can fetch and assert on. PayPal exposes a custom endpoint with field names you should not assume. Adyen is page-only. So the easy assertion you wrote for Stripe is only portable to one of the three. The rest of this post is that table, then the fixes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The per-vendor delta table
&lt;/h2&gt;

&lt;p&gt;This is the centerpiece. Stripe is the baseline column. Read across any row to see exactly where a copied monitor will break for the gateway you are actually on.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Stripe (baseline)&lt;/th&gt;
&lt;th&gt;PayPal&lt;/th&gt;
&lt;th&gt;Square&lt;/th&gt;
&lt;th&gt;Adyen&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Auth model&lt;/td&gt;
&lt;td&gt;Static secret key (no expiry)&lt;/td&gt;
&lt;td&gt;OAuth bearer (expires),{" "} &lt;code&gt;POST /v1/oauth2/token&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;OAuth bearer (expires), &lt;code&gt;/oauth2/token&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;API key / basic auth (no public OAuth token endpoint)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token-trap risk&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;HIGH&lt;/td&gt;
&lt;td&gt;LOW&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Webhook signature&lt;/td&gt;
&lt;td&gt;See{" "} &lt;a href="https://velprove.com/blog/monitor-stripe-webhooks" rel="noopener noreferrer"&gt;how Stripe signs webhook events&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Headers &lt;code&gt;PAYPAL-TRANSMISSION-SIG&lt;/code&gt; /{" "} &lt;code&gt;-ID&lt;/code&gt; / &lt;code&gt;-TIME&lt;/code&gt; /{" "} &lt;code&gt;-CERT-URL&lt;/code&gt; / &lt;code&gt;-AUTH-ALGO&lt;/code&gt; (e.g. SHA256withRSA) plus verify endpoint&lt;/td&gt;
&lt;td&gt;Header &lt;code&gt;x-square-hmacsha256-signature&lt;/code&gt;{" "} (HMAC-SHA-256)&lt;/td&gt;
&lt;td&gt;JSON field{" "} &lt;code&gt;additionalData.hmacSignature&lt;/code&gt; (HMAC-SHA256 base64). Not a header.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Webhook retries&lt;/td&gt;
&lt;td&gt;See{" "} &lt;a href="https://velprove.com/blog/monitor-stripe-webhooks" rel="noopener noreferrer"&gt;how Stripe retries failed deliveries&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;~25 times over ~3 days&lt;/td&gt;
&lt;td&gt;Retries with backoff&lt;/td&gt;
&lt;td&gt;Retries with backoff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Status JSON a monitor can assert&lt;/td&gt;
&lt;td&gt;See the{" "} &lt;a href="https://velprove.com/blog/monitor-stripe-api-health" rel="noopener noreferrer"&gt;Stripe API health guide&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Custom &lt;code&gt;/api/production&lt;/code&gt; (fields PayPal-specific, do not assume)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/api/v2/status.json&lt;/code&gt; (Atlassian, assertable)&lt;/td&gt;
&lt;td&gt;Page only, no confirmed JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sandbox / prod hosts&lt;/td&gt;
&lt;td&gt;n/a for this post&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;api-m.sandbox.paypal.com&lt;/code&gt; /{" "} &lt;code&gt;api-m.paypal.com&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;connect.squareupsandbox.com&lt;/code&gt; /{" "} &lt;code&gt;connect.squareup.com&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;checkout-test.adyen.com&lt;/code&gt; /{" "} &lt;code&gt;{&lt;/code&gt;{PREFIX}-checkout-live.adyenpayments.com&lt;code&gt;}&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The expiring-token trap: why a copied Stripe monitor goes false-DOWN
&lt;/h2&gt;

&lt;p&gt;This is the single most common false alert on non-Stripe gateways, so it gets its own section. A Stripe secret key never expires. You paste it into an Authorization header once and the monitor authenticates forever. So the natural thing to do when you move to PayPal or Square is the same thing: grab a token, paste it into the header, save the monitor.&lt;/p&gt;

&lt;p&gt;It works. For a while. PayPal and Square access tokens are OAuth bearer tokens with a finite lifetime. The token you pasted in is already counting down. When it lapses, the gateway starts answering your monitor with a &lt;code&gt;401 Unauthorized&lt;/code&gt;. Your monitor reads a non-2xx response and flips to DOWN. PayPal is up. Square is up. Every other customer is checking out fine. The only thing that is down is your stale token, and your monitor has no way to tell the difference between "the gateway is unreachable" and "my credential expired."&lt;/p&gt;

&lt;p&gt;Now you have the worst kind of monitor: one that cries wolf on a schedule. After the third or fourth 3 AM page that turned out to be nothing, the team mutes it. Then the gateway actually goes down and nobody is watching. A monitor that fires on its own expiring credential is worse than no monitor, because it trains you to ignore it.&lt;/p&gt;

&lt;p&gt;Adyen is the contrast that proves the point. Adyen authenticates with an API key or basic auth and has no public OAuth token endpoint, so there is no short-lived bearer to go stale between runs. The expiring-token trap does not apply to Adyen the same way. You still rotate API keys deliberately, but a correctly configured Adyen monitor does not false-DOWN on a timer. The trap is specific to the OAuth gateways, which is exactly why the fix below is too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: refresh the token as step 1 of a multi-step monitor
&lt;/h2&gt;

&lt;p&gt;The fix is to stop pasting a token and start minting one. Make the token refresh the first step of a multi-step monitor, then use the fresh token in the second step. The monitor mints a new bearer on every single run, so it never carries a stale one, and it can never false-DOWN because a credential lapsed between runs.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;PayPal&lt;/strong&gt;, step 1 is a &lt;code&gt;POST /v1/oauth2/token&lt;/code&gt; with your client credentials, which returns a fresh bearer. Step 2 calls the real endpoint you care about and sends that bearer in the Authorization header. For &lt;strong&gt;Square&lt;/strong&gt;, step 1 is a POST to &lt;code&gt;/oauth2/token&lt;/code&gt; to mint the bearer, step 2 uses it on the protected endpoint. In prose: step 1 posts the token endpoint and captures the bearer, step 2 GETs the protected route with it and asserts the status code is 200.&lt;/p&gt;

&lt;p&gt;Why does the &lt;em&gt;ordering&lt;/em&gt; matter so much here, when it is irrelevant on Stripe? Because on Stripe step 1 would be busywork. The secret key is already valid, so a refresh step adds nothing. On an OAuth gateway the refresh step is the entire reason the monitor stays honest. It converts "is my pasted token still alive" into "can I get a token and use it right now," which is the question you actually want answered. The token lifetime stops mattering because you never reuse a token across runs.&lt;/p&gt;

&lt;p&gt;This post owns only &lt;em&gt;why&lt;/em&gt; that ordering matters for OAuth gateways. For the mechanics of building a multi-step monitor, capturing a value from step 1, and passing it into step 2, follow the &lt;a href="https://velprove.com/blog/multi-step-api-monitoring-guide" rel="noopener noreferrer"&gt;multi-step API monitoring guide&lt;/a&gt; . The pattern is the same one used there; only the token endpoint and the protected route change per vendor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring the webhook leg
&lt;/h2&gt;

&lt;p&gt;The inbound webhook leg fails the same way on every gateway: signature verification breaks, the endpoint times out, or deliveries get dropped at your edge. The generic pattern for catching that, a chained monitor that proves an event arrived and your handler moved real state, is the same idea regardless of vendor. We cover it in full in the &lt;a href="https://velprove.com/blog/monitor-stripe-webhooks" rel="noopener noreferrer"&gt;end-to-end webhook delivery monitor walkthrough&lt;/a&gt; , and the structure ports cleanly. What does not port is the signature format, so that is all this section owns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PayPal.&lt;/strong&gt; PayPal does not use a single signature header. It sends a set: &lt;code&gt;PAYPAL-TRANSMISSION-SIG&lt;/code&gt;, &lt;code&gt;PAYPAL-TRANSMISSION-ID&lt;/code&gt;, &lt;code&gt;PAYPAL-TRANSMISSION-TIME&lt;/code&gt;, &lt;code&gt;PAYPAL-CERT-URL&lt;/code&gt;, and &lt;code&gt;PAYPAL-AUTH-ALGO&lt;/code&gt; (for example SHA256withRSA). It also offers a verify-webhook-signature endpoint you can call to confirm a payload server-side. PayPal retries a failed delivery roughly 25 times over about 3 days, so a handler that is down for an hour still gets the event later, but a handler that is silently rejecting forever burns the whole window. &lt;strong&gt;Square.&lt;/strong&gt; Square uses one header, &lt;code&gt;x-square-hmacsha256-signature&lt;/code&gt;, an HMAC-SHA-256 value. Closer to the Stripe shape, single header, but a different header name and a different signing input, so a copied Stripe verifier will not validate it as-is. &lt;strong&gt;Adyen.&lt;/strong&gt; Adyen is the one that breaks naive code. The HMAC is not in a header at all. It lives in the JSON body at &lt;code&gt;notificationItems[].NotificationRequestItem.additionalData.hmacSignature&lt;/code&gt; , a base64 HMAC-SHA256 value. A handler or monitor that scans request headers for a signature finds nothing and either fails open or rejects everything. You have to read the body.&lt;/p&gt;

&lt;h2&gt;
  
  
  Status pages: which one a monitor can actually read
&lt;/h2&gt;

&lt;p&gt;If a gateway exposes machine-readable status, you can add a cheap second signal: assert directly on its status feed. Only one of the three makes that easy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Square&lt;/strong&gt; runs an Atlassian Statuspage and exposes a clean JSON summary at &lt;code&gt;issquareup.com/api/v2/status.json&lt;/code&gt;. A monitor can fetch it and assert on the documented Atlassian schema, which is the same shape across every Statuspage-hosted service. This is the one status feed in the group you can confidently assert on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PayPal&lt;/strong&gt; publishes a status page at &lt;code&gt;paypal-status.com&lt;/code&gt; with a custom endpoint at &lt;code&gt;/api/production&lt;/code&gt;. It exists, but the field names are PayPal-specific and not a documented contract, so do not write assertions against specific fields you have not verified will stay stable. &lt;strong&gt;Adyen&lt;/strong&gt; publishes a status page at &lt;code&gt;status.adyen.com&lt;/code&gt; but has no confirmed machine-readable endpoint, so treat it as page-only.&lt;/p&gt;

&lt;p&gt;The practical rule: assert on Square's status JSON if you want a cheap platform-health signal, but for PayPal and Adyen do not lean on the status feed. Monitor your own integration path instead, the token refresh and the protected endpoint, because that is the signal that actually tells you whether your customers can pay. Deciding which third-party dependencies earn a monitor at all is its own triage, walked through in &lt;a href="https://velprove.com/blog/monitor-third-party-dependency-you-dont-own" rel="noopener noreferrer"&gt;monitoring a third-party dependency you do not own&lt;/a&gt; .&lt;/p&gt;

&lt;h2&gt;
  
  
  Sandbox vs production: the base-host swap you will forget
&lt;/h2&gt;

&lt;p&gt;Short gotcha, expensive when missed. All three gateways use a different base host for sandbox and production. Point a monitor at the sandbox host by accident and it reports green while production is on fire, because the sandbox really is up. Confirm the host before you save.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PayPal:&lt;/strong&gt; &lt;code&gt;api-m.sandbox.paypal.com&lt;/code&gt; (sandbox) versus &lt;code&gt;api-m.paypal.com&lt;/code&gt; (production). &lt;strong&gt;Square:&lt;/strong&gt; &lt;code&gt;connect.squareupsandbox.com&lt;/code&gt; (sandbox) versus &lt;code&gt;connect.squareup.com&lt;/code&gt; (production). &lt;strong&gt;Adyen:&lt;/strong&gt; &lt;code&gt;checkout-test.adyen.com&lt;/code&gt; (test) versus &lt;code&gt;{PREFIX}-checkout-live.adyenpayments.com&lt;/code&gt; (live, where &lt;code&gt;{PREFIX}&lt;/code&gt; is your account-specific live prefix).&lt;/p&gt;

&lt;p&gt;The same swap shows up on the client side if you probe whether the checkout SDK loads. PayPal serves its JS SDK from &lt;code&gt;https://www.paypal.com/sdk/js?client-id=...&lt;/code&gt;, which you probe with a GET using a valid client-id, not a bare URL. Square serves &lt;code&gt;https://web.squarecdn.com/v1/square.js&lt;/code&gt; in production and &lt;code&gt;https://sandbox.web.squarecdn.com/v1/square.js&lt;/code&gt; in sandbox. Adyen serves &lt;code&gt;https://checkoutshopper-live.adyen.com/checkoutshopper/sdk/{VERSION}/adyen.js&lt;/code&gt; , where &lt;code&gt;{VERSION}&lt;/code&gt; is the SDK version you have pinned.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to do this in Velprove, free and no-code
&lt;/h2&gt;

&lt;p&gt;Velprove is built for exactly this gap. The headline differentiator is the &lt;strong&gt;free, no-code browser login monitor&lt;/strong&gt;: it opens a real browser, signs in to your own application with a dedicated test account, and confirms the signed-in path renders. If your checkout sits behind a customer login, that monitor proves a real user can actually reach it, with no code to write.&lt;/p&gt;

&lt;p&gt;For the gateway calls themselves, the free plan includes &lt;strong&gt;multi-step API monitors of up to 3 steps&lt;/strong&gt;, which is all the token-refresh pattern needs. Step 1 posts the OAuth token endpoint and captures the bearer, step 2 calls the protected endpoint with it and asserts the status code is 200. You can add a third step to assert on a read-back if you want to prove a real object exists. You pick &lt;strong&gt;which of 5 global regions each monitor runs from&lt;/strong&gt;, on every plan. To catch a checkout SDK that fails to load in just one geography, deploy the same monitor in each region you care about. Each copy uses one of your monitor slots, and the region where the SDK breaks goes red while the others stay green. Commercial use is allowed on every plan including free, with no credit card.&lt;/p&gt;

&lt;p&gt;One safety rule, the same on every gateway: use low-privilege test credentials against read paths, never a real admin key, and never run monitors that move real money. A monitoring credential that leaks should be able to read that a token can be minted and an endpoint responds, nothing more.&lt;/p&gt;

&lt;p&gt;If your gateway sits inside a store rather than a raw API integration, the monitor you want is on the checkout flow, not the gateway endpoint. For WooCommerce see &lt;a href="https://velprove.com/blog/monitor-woocommerce-checkout" rel="noopener noreferrer"&gt;monitoring the WooCommerce checkout&lt;/a&gt; , and for Shopify see &lt;a href="https://velprove.com/blog/monitor-shopify-checkout-flow" rel="noopener noreferrer"&gt;monitoring the Shopify checkout flow&lt;/a&gt; . Both drive the actual buy path a customer takes, which is the truest signal that payments work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://velprove.com/signup" rel="noopener noreferrer"&gt;Set up a free, no-code payment gateway monitor with Velprove&lt;/a&gt; . Two steps, the region you choose, no credit card, commercial use allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why does my PayPal monitor report DOWN when PayPal is actually up?
&lt;/h3&gt;

&lt;p&gt;Almost always because the monitor is holding a stale OAuth bearer token. PayPal access tokens expire, unlike a Stripe secret key, which never does. If you copied a Stripe-style monitor that pastes one static token into a header, that token lapses after its lifetime and PayPal starts returning &lt;code&gt;401 Unauthorized&lt;/code&gt;. Your monitor reads the 401 as DOWN even though PayPal is serving every other customer fine. The fix is to mint a fresh token as step 1 of a multi-step monitor by posting to the OAuth token endpoint, then use that token in step 2. The monitor refreshes the token on every run, so it never carries a stale one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do Square and PayPal access tokens expire, and how often do I need to refresh them?
&lt;/h3&gt;

&lt;p&gt;Yes, both PayPal and Square use OAuth bearer tokens that expire. The exact lifetime depends on your app and the grant type, and you should read the value the token endpoint returns rather than hardcode a number. The safe pattern for monitoring is to stop tracking the lifetime entirely: mint a fresh token at the start of every monitor run. Step 1 of a multi-step monitor posts to the token endpoint and captures the new bearer, step 2 uses it. Adyen is different. It authenticates with an API key or basic auth and has no public OAuth token endpoint, so the expiring-token trap does not apply the same way.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is Adyen's webhook signature different from PayPal's or Square's?
&lt;/h3&gt;

&lt;p&gt;Adyen puts its HMAC signature inside the JSON body, not in an HTTP header. You read it from &lt;code&gt;notificationItems[].NotificationRequestItem.additionalData.hmacSignature&lt;/code&gt; , a base64 HMAC-SHA256 value. Square puts its signature in the &lt;code&gt;x-square-hmacsha256-signature&lt;/code&gt; header. PayPal uses a set of headers, &lt;code&gt;PAYPAL-TRANSMISSION-SIG&lt;/code&gt;, &lt;code&gt;PAYPAL-TRANSMISSION-ID&lt;/code&gt;, &lt;code&gt;PAYPAL-TRANSMISSION-TIME&lt;/code&gt;, &lt;code&gt;PAYPAL-CERT-URL&lt;/code&gt;, and &lt;code&gt;PAYPAL-AUTH-ALGO&lt;/code&gt;, and offers a verify-webhook-signature endpoint you can call to confirm a payload. So a monitor or handler written to read a signature header will silently miss Adyen entirely, because there is no signature header to read.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which payment gateway has a status page my monitor can actually read?
&lt;/h3&gt;

&lt;p&gt;Square is the clean one. It runs an Atlassian Statuspage and exposes a machine-readable summary at &lt;code&gt;issquareup.com/api/v2/status.json&lt;/code&gt; that a monitor can fetch and assert on directly. PayPal publishes a status page at &lt;code&gt;paypal-status.com&lt;/code&gt; with a custom endpoint at &lt;code&gt;/api/production&lt;/code&gt;, but the field names are PayPal-specific and not a documented contract, so do not assume them. Adyen publishes a status page at &lt;code&gt;status.adyen.com&lt;/code&gt; but has no confirmed machine-readable JSON endpoint, so treat it as page-only. For PayPal and Adyen the more reliable signal is to monitor your own integration path rather than parse their status feed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I monitor PayPal, Square, or Adyen for free without writing code?
&lt;/h3&gt;

&lt;p&gt;Yes. Velprove's free plan includes no-code browser login monitors and multi-step API monitors of up to 3 steps, and you pick which of 5 global regions each monitor runs from. For a gateway with an expiring token you build a 2-step API monitor in the wizard: step 1 posts to the OAuth token endpoint and captures the bearer, step 2 calls the protected endpoint with it and asserts the status code. No code, no credit card, and commercial use is allowed on every plan including free.&lt;/p&gt;

&lt;h3&gt;
  
  
  Will any of these gateways alert me when my integration breaks?
&lt;/h3&gt;

&lt;p&gt;No. PayPal, Square, and Adyen publish status pages for their own platform health, but none of them watches your specific integration and none of them emails you when your token expired, your webhook endpoint started returning 500, or your checkout SDK stopped loading in one region. Their status page stays green because their platform is fine. Your integration is the part that broke, and only a monitor pointed at your own path catches it. That is the entire reason to run external functional monitoring on a gateway you do not own.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I monitor the sandbox or production host?
&lt;/h3&gt;

&lt;p&gt;Monitor production, because that is the host your customers pay through. The trap is that PayPal, Square, and Adyen all use different base hosts for sandbox and production, and a monitor accidentally pointed at the sandbox host will report green while production is on fire. PayPal production is &lt;code&gt;api-m.paypal.com&lt;/code&gt; versus &lt;code&gt;api-m.sandbox.paypal.com&lt;/code&gt;. Square production is &lt;code&gt;connect.squareup.com&lt;/code&gt; versus &lt;code&gt;connect.squareupsandbox.com&lt;/code&gt;. Adyen production is your prefixed live host versus &lt;code&gt;checkout-test.adyen.com&lt;/code&gt;. Use low-privilege test credentials against production read paths, never a real admin key, and confirm the base host before you save the monitor.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>webdev</category>
      <category>devops</category>
      <category>uptime</category>
    </item>
  </channel>
</rss>
