<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vantaj</title>
    <description>The latest articles on DEV Community by Vantaj (@vantaj_co).</description>
    <link>https://dev.to/vantaj_co</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3998909%2Fe3983466-0d83-4e12-877f-9cc6cb3e8055.png</url>
      <title>DEV Community: Vantaj</title>
      <link>https://dev.to/vantaj_co</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vantaj_co"/>
    <language>en</language>
    <item>
      <title>DNS Propagation Explained - Why Your Site Changes Take Hours</title>
      <dc:creator>Vantaj</dc:creator>
      <pubDate>Thu, 02 Jul 2026 14:29:35 +0000</pubDate>
      <link>https://dev.to/vantaj_co/dns-propagation-explained-why-your-site-changes-take-hours-1dke</link>
      <guid>https://dev.to/vantaj_co/dns-propagation-explained-why-your-site-changes-take-hours-1dke</guid>
      <description>&lt;h2&gt;
  
  
  You Changed the Record. Why Is Nothing Happening?
&lt;/h2&gt;

&lt;p&gt;You updated your A record to point to a new server. You triple-checked the value. Your DNS provider confirmed the change is saved. But when you visit your domain, it still loads the old site - and it's been 20 minutes.&lt;/p&gt;

&lt;p&gt;You're not doing anything wrong. This is DNS propagation: the time it takes for your DNS change to spread across the global network of DNS resolvers that translate domain names into IP addresses. It's one of the most misunderstood concepts in web infrastructure, and it causes more panic than it should.&lt;/p&gt;

&lt;p&gt;Here's what's actually happening, why it takes so long, and what you can do about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How DNS Resolution Works (The 30-Second Version)
&lt;/h2&gt;

&lt;p&gt;When someone types &lt;code&gt;yoursite.com&lt;/code&gt; into a browser, the request doesn't go directly to your server. It goes through a chain of DNS lookups:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Browser cache&lt;/strong&gt; - Has the browser resolved this domain recently? If yes, use the cached IP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS cache&lt;/strong&gt; - Has the operating system resolved it recently? If yes, use that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive resolver&lt;/strong&gt; - Your ISP or a public resolver (Google's 8.8.8.8, Cloudflare's 1.1.1.1) looks up the domain on your behalf.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Root nameserver&lt;/strong&gt; - The recursive resolver asks a root server: "Who is responsible for .com domains?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLD nameserver&lt;/strong&gt; - The .com nameserver responds: "The nameservers for yoursite.com are ns1.your-dns-provider.com"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authoritative nameserver&lt;/strong&gt; - Your DNS provider's nameserver returns the actual A record: "yoursite.com → 203.0.113.42"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response cached&lt;/strong&gt; - The recursive resolver caches this answer for the duration of the TTL (Time to Live) and returns it to the browser.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Next time someone resolves &lt;code&gt;yoursite.com&lt;/code&gt;, steps 4–6 are skipped - the recursive resolver returns its cached answer. This is the caching layer that causes "propagation delay."&lt;/p&gt;

&lt;h2&gt;
  
  
  What DNS Propagation Actually Is
&lt;/h2&gt;

&lt;p&gt;DNS propagation isn't a broadcast. Your DNS provider doesn't push your new record to every resolver on the internet. Instead, it works by &lt;strong&gt;cache expiration&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You update your A record from &lt;code&gt;203.0.113.42&lt;/code&gt; to &lt;code&gt;198.51.100.7&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Your authoritative nameserver immediately serves the new value&lt;/li&gt;
&lt;li&gt;But every recursive resolver that recently looked up your domain still has the old value cached&lt;/li&gt;
&lt;li&gt;Those resolvers will continue serving the old value until their cached copy expires (based on TTL)&lt;/li&gt;
&lt;li&gt;After expiration, the next lookup fetches the new value from your authoritative nameserver&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;"Propagation" is really "waiting for caches around the world to expire." There's no propagation mechanism - it's just distributed cache invalidation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Takes So Long
&lt;/h2&gt;

&lt;h3&gt;
  
  
  TTL (Time to Live)
&lt;/h3&gt;

&lt;p&gt;Every DNS record has a TTL value, measured in seconds. It tells recursive resolvers how long they're allowed to cache the record before checking for updates.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;TTL Value&lt;/th&gt;
&lt;th&gt;Cache Duration&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Records that change frequently, pre-migration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3600&lt;/td&gt;
&lt;td&gt;1 hour&lt;/td&gt;
&lt;td&gt;Standard TTL, good balance of performance and freshness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14400&lt;/td&gt;
&lt;td&gt;4 hours&lt;/td&gt;
&lt;td&gt;Stable records that rarely change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;86400&lt;/td&gt;
&lt;td&gt;24 hours&lt;/td&gt;
&lt;td&gt;Very stable records (MX, NS records)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your TTL was 86400 (24 hours) when you made the change, some resolvers cached the old value up to 24 hours ago - and they won't check again until those 24 hours expire. This is why DNS changes can take "up to 48 hours" (24-hour TTL + resolvers that ignore TTL).&lt;/p&gt;

&lt;h3&gt;
  
  
  Resolvers That Ignore TTL
&lt;/h3&gt;

&lt;p&gt;Some ISP resolvers enforce minimum cache times regardless of your TTL setting. If you set a TTL of 300 seconds (5 minutes) but a resolver enforces a minimum of 1 hour, your change won't be visible to users on that ISP for at least an hour.&lt;/p&gt;

&lt;p&gt;This isn't common with major public resolvers (Google, Cloudflare, Quad9) but does happen with smaller ISPs and corporate DNS infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multiple Cache Layers
&lt;/h3&gt;

&lt;p&gt;The browser, operating system, and recursive resolver each maintain their own cache. Even after the recursive resolver gets the new value, a user's browser might still show the old site because of its local cache. This is why "it works on my phone but not my laptop" is a common DNS complaint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Negative Caching
&lt;/h3&gt;

&lt;p&gt;If someone looked up your domain before you created a record (getting an NXDOMAIN or empty response), that "doesn't exist" answer is also cached - typically for 15 minutes to an hour. New records can appear to not exist for some users because of negative caching.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Check Propagation Progress
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Using Online Propagation Checkers
&lt;/h3&gt;

&lt;p&gt;Tools like whatsmydns.net, dnschecker.org, and dig (command line) query DNS resolvers in different geographic locations and show you what each one returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check what Google's resolver sees&lt;/span&gt;
dig @8.8.8.8 yoursite.com A

&lt;span class="c"&gt;# Check what Cloudflare's resolver sees&lt;/span&gt;
dig @1.1.1.1 yoursite.com A

&lt;span class="c"&gt;# Check the authoritative answer directly (bypasses all caches)&lt;/span&gt;
dig @ns1.your-provider.com yoursite.com A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the authoritative nameserver returns the new value but public resolvers still return the old one, propagation is in progress. If the authoritative nameserver returns the old value, the change hasn't been applied correctly - check your DNS provider's dashboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  What "Fully Propagated" Means
&lt;/h3&gt;

&lt;p&gt;There's no official moment when propagation is "complete." It's a gradual process where more and more resolvers worldwide get the updated value as their caches expire. In practice, most resolvers will have the new value within:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5–15 minutes&lt;/strong&gt; if your previous TTL was 300 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1–4 hours&lt;/strong&gt; if your previous TTL was 3600 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12–48 hours&lt;/strong&gt; if your previous TTL was 86400 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note: it's the &lt;strong&gt;previous&lt;/strong&gt; TTL that matters, not the new one you set. The cache expiration is based on the TTL that was served when the record was last cached.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Speed Up DNS Propagation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lower Your TTL Before Making Changes
&lt;/h3&gt;

&lt;p&gt;This is the single most effective strategy. If you know a DNS change is coming (migration, new server, new CDN), lower your TTL 24–48 hours in advance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;48 hours before:&lt;/strong&gt; Change TTL from 3600 to 300 (5 minutes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wait 48 hours&lt;/strong&gt; for the old high-TTL cache entries to expire&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make your DNS change&lt;/strong&gt; - now resolvers will check back in 5 minutes instead of 1 hour&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After propagation:&lt;/strong&gt; Raise TTL back to 3600 for better performance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This reduces your propagation window from hours to minutes. It requires planning ahead, but it's the difference between a seamless migration and a 24-hour period of inconsistent behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flush Your Local Cache
&lt;/h3&gt;

&lt;p&gt;While you can't flush every resolver's cache worldwide, you can flush your own:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;macOS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dscacheutil &lt;span class="nt"&gt;-flushcache&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;killall &lt;span class="nt"&gt;-HUP&lt;/span&gt; mDNSResponder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ipconfig /flushdns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Chrome browser:&lt;/strong&gt;&lt;br&gt;
Navigate to &lt;code&gt;chrome://net-internals/#dns&lt;/code&gt; and click "Clear host cache"&lt;/p&gt;

&lt;p&gt;This only fixes it for you - useful for verifying that propagation is complete from your location, but doesn't help your users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use a DNS Provider with Fast Propagation
&lt;/h3&gt;

&lt;p&gt;Some DNS providers have faster propagation than others because they use lower default TTLs and have better-connected nameserver networks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare&lt;/strong&gt; - Typically propagates globally within 5 minutes for proxied records&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Route 53&lt;/strong&gt; - Propagation within minutes due to large nameserver network&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Cloud DNS&lt;/strong&gt; - Fast propagation via Google's global infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Smaller or older DNS providers with fewer nameserver locations may take longer.&lt;/p&gt;

&lt;h2&gt;
  
  
  DNS Changes and Monitoring
&lt;/h2&gt;

&lt;p&gt;DNS changes are one of the most common causes of unexpected downtime. A misconfigured A record, a botched migration, or an unexpectedly long propagation window can make your site unreachable for some users while appearing fine for others.&lt;/p&gt;

&lt;h3&gt;
  
  
  How DNS Issues Affect Your Monitoring
&lt;/h3&gt;

&lt;p&gt;If your monitoring checks from US East but your DNS change hasn't propagated to that resolver yet, your monitoring will show the old server as "up" while users in other regions see errors (or vice versa). This is why multi-region monitoring matters during DNS changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-region monitoring&lt;/strong&gt; might show your site as healthy from its location while it's broken for 30% of your users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-region monitoring&lt;/strong&gt; catches propagation inconsistencies immediately - if one probe region resolves to the new IP and another resolves to the old IP (which is now offline), you'll get an alert&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Using DNS Monitoring to Catch Problems
&lt;/h3&gt;

&lt;p&gt;Beyond uptime checks, dedicated DNS monitoring tracks your actual record values over time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A/AAAA record changes&lt;/strong&gt; - Get alerted if your domain starts resolving to an unexpected IP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NS record changes&lt;/strong&gt; - Detect unauthorized nameserver changes (domain hijacking)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MX record changes&lt;/strong&gt; - Catch mail routing issues before email delivery fails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TXT record changes&lt;/strong&gt; - SPF, DKIM, DMARC modifications that affect email deliverability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vantaj monitors DNS records and alerts you when they change - whether you made the change intentionally or not. Combined with domain expiry monitoring and SSL certificate tracking, you get complete visibility into the infrastructure layer that sits between your users and your servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common DNS Propagation Mistakes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Changing DNS and Shutting Down the Old Server Immediately
&lt;/h3&gt;

&lt;p&gt;If your TTL was 3600 (1 hour), some users will still resolve to your old server IP for up to an hour after the change. If you've already shut that server down, those users get connection refused errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Keep the old server running for at least 2x your previous TTL after making the DNS change. Only decommission it after propagation is complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing Only from Your Own Machine
&lt;/h3&gt;

&lt;p&gt;"It works for me" doesn't mean it works for your users. Your local DNS cache might have the new value while ISP resolvers in other countries still serve the old one. Use propagation checkers or multi-region monitoring to verify globally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting TTL to 0
&lt;/h3&gt;

&lt;p&gt;A TTL of 0 means "don't cache this at all." In theory, every lookup should hit your authoritative nameserver. In practice, many resolvers enforce a minimum TTL of 30–300 seconds regardless of what you set. A TTL of 0 also dramatically increases load on your nameservers and can cause resolution delays.&lt;/p&gt;

&lt;h3&gt;
  
  
  Forgetting About Email
&lt;/h3&gt;

&lt;p&gt;When migrating a domain, teams often focus on web traffic (A records) and forget about email (MX records). If your MX records point to an old mail server that's been decommissioned, incoming email silently bounces. Monitor your MX records alongside your A records during any migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;DNS propagation isn't instant because the internet is a distributed caching system. Every resolver caches your records independently, and they all expire on their own schedule. You can't push changes - you can only wait for caches to expire.&lt;/p&gt;

&lt;p&gt;The best strategy: plan ahead. Lower your TTL before making changes, keep old infrastructure running during propagation, monitor from multiple regions, and use DNS record monitoring to verify that changes are applied correctly and consistently worldwide.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>infrastructure</category>
      <category>networking</category>
      <category>webdev</category>
    </item>
    <item>
      <title>How to Monitor an LLM API: What Uptime Tools Won't Tell You</title>
      <dc:creator>Vantaj</dc:creator>
      <pubDate>Thu, 02 Jul 2026 14:28:27 +0000</pubDate>
      <link>https://dev.to/vantaj_co/how-to-monitor-an-llm-api-what-uptime-tools-wont-tell-you-4bnl</link>
      <guid>https://dev.to/vantaj_co/how-to-monitor-an-llm-api-what-uptime-tools-wont-tell-you-4bnl</guid>
      <description>&lt;h2&gt;
  
  
  Your LLM Endpoint Returns 200. That Tells You Almost Nothing.
&lt;/h2&gt;

&lt;p&gt;Standard uptime monitoring checks whether a URL responds and whether it returns an expected status code. For a traditional API, that's a reasonable proxy for health.&lt;/p&gt;

&lt;p&gt;For an LLM endpoint, it's nearly useless.&lt;/p&gt;

&lt;p&gt;A 200 response from &lt;code&gt;/v1/chat/completions&lt;/code&gt; tells you the service is alive. It doesn't tell you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether the response came back in 2 seconds or 45 seconds&lt;/li&gt;
&lt;li&gt;Whether you're about to hit your daily token quota&lt;/li&gt;
&lt;li&gt;Whether you're being silently rate limited at the organization level&lt;/li&gt;
&lt;li&gt;Whether the model you requested is actually available or fell back to a different one&lt;/li&gt;
&lt;li&gt;Whether the response content is valid JSON, properly formatted, and non-empty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the failure modes that actually break user-facing AI features. And almost none of them show up in a standard HTTP monitor.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Ways LLM APIs Fail (That HTTP Monitoring Misses)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Latency Spikes
&lt;/h3&gt;

&lt;p&gt;LLM inference is not like a database query. Response time varies with input token count, output length, model size, infrastructure load, and geographic distance to the model provider's datacenters.&lt;/p&gt;

&lt;p&gt;A typical GPT-4o call might take 1.5 seconds under normal load. Under high load, or with a long output, it can take 30–60 seconds. Both return 200. Both look identical to a standard uptime monitor.&lt;/p&gt;

&lt;p&gt;From a user experience perspective, they are not identical.&lt;/p&gt;

&lt;p&gt;If your AI feature has an acceptable response time of 5 seconds and the model provider is regularly delivering in 15–20 seconds, your users are seeing a broken feature. Your uptime dashboard stays green.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you actually need to monitor:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;P50, P95, and P99 latency - not just average&lt;/li&gt;
&lt;li&gt;Time-to-first-token (TTFT) separately from total response time, especially for streaming endpoints&lt;/li&gt;
&lt;li&gt;Latency trends over time, not just point-in-time checks&lt;/li&gt;
&lt;li&gt;Latency by input token count, if your use case has variable prompt lengths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A health check that sends a fixed short prompt and measures total response time gives you a consistent baseline. If that baseline starts drifting - 2 seconds becomes 5 seconds, then 8 seconds - something upstream changed.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Rate Limits and 429 Errors
&lt;/h3&gt;

&lt;p&gt;Rate limiting from LLM providers is more complex than most APIs.&lt;/p&gt;

&lt;p&gt;Most providers enforce limits at multiple levels simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Requests per minute (RPM)&lt;/strong&gt; - total number of API calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokens per minute (TPM)&lt;/strong&gt; - total tokens (input + output) processed per minute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokens per day (TPD)&lt;/strong&gt; - daily token budget, especially on free tiers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organization-level limits&lt;/strong&gt; - separate from per-key limits, sometimes lower&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 429 response means one of these limits was hit. But which one? And is it a brief burst that will recover in 60 seconds, or a hard daily quota that resets at midnight?&lt;/p&gt;

&lt;p&gt;Standard monitoring treats all 4xx responses as errors. But a 429 is a different kind of error than a 404 or a 401. It's temporary, self-resolving, and requires different handling in your application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you actually need to monitor:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track 429 response rates separately from other error rates&lt;/li&gt;
&lt;li&gt;Alert when 429 rate exceeds a threshold - not on first occurrence&lt;/li&gt;
&lt;li&gt;Monitor token consumption trends if the provider exposes usage headers (&lt;code&gt;x-ratelimit-remaining-tokens&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Set up a heartbeat that runs a minimal test prompt on a schedule to validate quota is healthy before peak usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your application doesn't have alerting specifically for quota exhaustion, you'll find out when users start getting errors - not before.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cold Starts
&lt;/h3&gt;

&lt;p&gt;Several LLM providers and inference platforms spin down compute when idle and restart on demand. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hosted models on auto-scaling infrastructure&lt;/li&gt;
&lt;li&gt;Smaller model providers and inference startups&lt;/li&gt;
&lt;li&gt;Fine-tuned models deployed on serverless GPU platforms (Modal, Replicate, Runpod)&lt;/li&gt;
&lt;li&gt;Open-source model deployments on spot infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cold start latency can range from a few seconds to over a minute, depending on model size and platform. During a cold start, the API typically returns 200 - it just takes much longer than usual.&lt;/p&gt;

&lt;p&gt;For user-facing features, a 45-second cold start is functionally a timeout. Users close the tab, report the feature as broken, or abandon the flow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you actually need to monitor:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track time-to-first-response, not just whether a response arrived&lt;/li&gt;
&lt;li&gt;Alert when response time exceeds a threshold that indicates a cold start (e.g., &amp;gt;10 seconds for a short prompt)&lt;/li&gt;
&lt;li&gt;For self-hosted deployments: monitor whether GPU workers are warm using a keep-alive heartbeat that fires every few minutes&lt;/li&gt;
&lt;li&gt;Consider a scheduled warm-up request that runs before peak usage hours&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Degraded or Wrong Responses
&lt;/h3&gt;

&lt;p&gt;This one is the hardest to monitor but often the most impactful.&lt;/p&gt;

&lt;p&gt;An LLM can return:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An empty &lt;code&gt;choices&lt;/code&gt; array with a 200 status&lt;/li&gt;
&lt;li&gt;A response with &lt;code&gt;finish_reason: "length"&lt;/code&gt; indicating the output was cut off&lt;/li&gt;
&lt;li&gt;A malformed JSON response that breaks downstream parsing&lt;/li&gt;
&lt;li&gt;A refusal or safety filter response that doesn't match the expected output format&lt;/li&gt;
&lt;li&gt;A response from the wrong model version if the requested model was unavailable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are 5xx errors. None are 4xx errors. They all return 200. And they all break downstream behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you actually need to monitor:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Validate that &lt;code&gt;choices[0].message.content&lt;/code&gt; is non-empty&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;finish_reason&lt;/code&gt; - &lt;code&gt;"stop"&lt;/code&gt; is expected; &lt;code&gt;"length"&lt;/code&gt; or &lt;code&gt;"content_filter"&lt;/code&gt; may indicate problems&lt;/li&gt;
&lt;li&gt;Validate that output matches expected structure (especially for JSON mode or tool-calling responses)&lt;/li&gt;
&lt;li&gt;Alert on elevated rates of truncated responses, which can indicate the provider is under load and reducing output quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This kind of monitoring is closer to synthetic testing than uptime monitoring. You're not just checking if the endpoint is alive - you're checking if it's producing useful output.&lt;/p&gt;

&lt;h2&gt;
  
  
  What LLM API Monitoring Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Here's a practical setup for monitoring a production LLM feature:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Basic Availability (HTTP Monitor)
&lt;/h3&gt;

&lt;p&gt;Use a standard HTTP monitor to check that the endpoint responds at all. Set it up with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A short, fixed test prompt (e.g., &lt;code&gt;"Reply with 'OK' and nothing else"&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;An expected response body check for &lt;code&gt;"OK"&lt;/code&gt; or the string you expect&lt;/li&gt;
&lt;li&gt;A timeout of 15–20 seconds (longer than a normal API but accounts for variable inference time)&lt;/li&gt;
&lt;li&gt;Alerts on 5xx responses and on timeouts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This catches the basic cases: service is completely down, returning errors, or unresponsive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Latency Baseline (Response Time Monitoring)
&lt;/h3&gt;

&lt;p&gt;Configure your monitor to track response time trends and alert when they deviate significantly from baseline. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alert if average response time for your test prompt exceeds 2–3x the historical baseline&lt;/li&gt;
&lt;li&gt;Track this metric weekly - gradual drift often signals infrastructure changes upstream&lt;/li&gt;
&lt;li&gt;For streaming endpoints, measure time to first byte separately&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 3: Error Rate Tracking (Keyword + Status Monitoring)
&lt;/h3&gt;

&lt;p&gt;Run a scheduled monitor that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checks for 429 response codes separately from other 4xx/5xx errors&lt;/li&gt;
&lt;li&gt;Validates that the response body contains expected fields (&lt;code&gt;choices&lt;/code&gt;, &lt;code&gt;usage&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Checks that &lt;code&gt;usage.total_tokens&lt;/code&gt; is non-zero (a zero token count usually indicates a malformed request or empty response)&lt;/li&gt;
&lt;li&gt;Alerts if &lt;code&gt;finish_reason&lt;/code&gt; in the response is &lt;code&gt;"content_filter"&lt;/code&gt; or &lt;code&gt;"length"&lt;/code&gt; more than occasionally&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 4: Quota Health (Heartbeat / Scheduled Check)
&lt;/h3&gt;

&lt;p&gt;For providers that expose quota information in response headers or via a separate &lt;code&gt;/usage&lt;/code&gt; endpoint:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up a daily check that queries current token usage vs. limits&lt;/li&gt;
&lt;li&gt;Run this before your peak usage window - not after you've already hit the limit&lt;/li&gt;
&lt;li&gt;Treat quota at &amp;gt;80% utilization as a warning, not a critical alert&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 5: Dependency Status (External Monitor)
&lt;/h3&gt;

&lt;p&gt;Monitor your AI provider's status page directly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI: &lt;code&gt;https://status.openai.com/api/v2/status.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Anthropic: &lt;code&gt;https://status.anthropic.com/api/v2/status.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Most providers expose a machine-readable status endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Set up an HTTP monitor on this endpoint and alert when status changes from &lt;code&gt;"All Systems Operational"&lt;/code&gt;. This gives you advance warning of provider-side degradation before it fully impacts your users - and helps you quickly determine whether an incident is on your side or theirs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Provider-Side Outage Problem
&lt;/h2&gt;

&lt;p&gt;One of the hardest monitoring challenges for AI-powered applications is distinguishing between your infrastructure failing and your AI provider failing.&lt;/p&gt;

&lt;p&gt;Standard monitoring can't tell the difference. Both show up as elevated error rates or latency spikes in your application metrics.&lt;/p&gt;

&lt;p&gt;You need two separate monitoring layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Your application endpoint&lt;/strong&gt; - monitors whether your service is responding correctly end-to-end&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The provider's API directly&lt;/strong&gt; - monitors whether OpenAI, Anthropic, or whoever you depend on is healthy&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When both show problems simultaneously, it's almost certainly the provider. When only your application shows problems, it's almost certainly you.&lt;/p&gt;

&lt;p&gt;Without both layers, you'll spend time debugging your infrastructure during provider outages, and miss application-side regressions when the provider is healthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference: LLM API Failure Modes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Mode&lt;/th&gt;
&lt;th&gt;Status Code&lt;/th&gt;
&lt;th&gt;Caught by HTTP Monitor?&lt;/th&gt;
&lt;th&gt;What to Actually Check&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Service completely down&lt;/td&gt;
&lt;td&gt;503 / 0&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;Standard HTTP check&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limit hit&lt;/td&gt;
&lt;td&gt;429&lt;/td&gt;
&lt;td&gt;⚠️ Only if you check for it&lt;/td&gt;
&lt;td&gt;Track 429 rate separately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency spike / cold start&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Response time threshold alert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quota exhaustion (soft)&lt;/td&gt;
&lt;td&gt;429&lt;/td&gt;
&lt;td&gt;⚠️ Only if you check for it&lt;/td&gt;
&lt;td&gt;Token usage headers / /usage endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Empty or truncated output&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Validate &lt;code&gt;choices[0].message.content&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wrong model version&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;model&lt;/code&gt; field in response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output cut off&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;finish_reason != "length"&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider degradation&lt;/td&gt;
&lt;td&gt;200 (slow)&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Monitor provider status page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth token expired&lt;/td&gt;
&lt;td&gt;401&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;Standard HTTP check&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Monitoring Gap Is Getting Larger
&lt;/h2&gt;

&lt;p&gt;As more production systems depend on LLM APIs, the gap between "standard uptime monitoring" and "meaningful AI infrastructure monitoring" is growing.&lt;/p&gt;

&lt;p&gt;A traditional API either works or it doesn't. Response time variance is usually small and predictable. Error modes are well-understood and well-documented.&lt;/p&gt;

&lt;p&gt;LLM APIs are different in almost every dimension. They're probabilistic, slow, expensive per call, and fail in ways that look like success to naive monitoring.&lt;/p&gt;

&lt;p&gt;Getting ahead of this means treating LLM API monitoring as its own discipline - not as an afterthought on top of your existing HTTP checks.&lt;/p&gt;

&lt;p&gt;Your users will notice the difference before your monitoring does, unless you build the right checks first.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>llm</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>HTTP Status Codes: Complete Reference Guide (2026)</title>
      <dc:creator>Vantaj</dc:creator>
      <pubDate>Thu, 02 Jul 2026 14:27:32 +0000</pubDate>
      <link>https://dev.to/vantaj_co/http-status-codes-complete-reference-guide-2026-1h2</link>
      <guid>https://dev.to/vantaj_co/http-status-codes-complete-reference-guide-2026-1h2</guid>
      <description>&lt;p&gt;HTTP status codes are three-digit numbers a server sends back with every response. The first digit tells you the class of response. The next two digits narrow it down.&lt;/p&gt;

&lt;p&gt;This guide covers every meaningful status code - what it means, when you'll encounter it, what to do when your monitoring catches it, and which ones matter most for reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Status Code Classes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Class&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1xx&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100–199&lt;/td&gt;
&lt;td&gt;Informational - request received, processing continues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2xx&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200–299&lt;/td&gt;
&lt;td&gt;Success - request received, understood, and accepted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3xx&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;300–399&lt;/td&gt;
&lt;td&gt;Redirection - further action needed to complete request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4xx&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;400–499&lt;/td&gt;
&lt;td&gt;Client error - request contains bad syntax or can't be fulfilled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5xx&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;500–599&lt;/td&gt;
&lt;td&gt;Server error - server failed to fulfill a valid request&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dividing line at 4xx vs. 5xx matters for monitoring: a 4xx means the client did something wrong; a 5xx means the server failed. When your uptime monitor fires on a 4xx, check your monitor configuration. When it fires on a 5xx, check your infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  1xx - Informational
&lt;/h2&gt;

&lt;p&gt;These codes acknowledge the request is in progress. You rarely encounter them in standard HTTP/1.1 flows, but they appear in HTTP/2 push scenarios and WebSocket upgrade handshakes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;100&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Continue&lt;/td&gt;
&lt;td&gt;The server received the request headers and the client should proceed with sending the request body. Used when the client sends &lt;code&gt;Expect: 100-continue&lt;/code&gt; before a large upload.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;101&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Switching Protocols&lt;/td&gt;
&lt;td&gt;The server agrees to upgrade the connection protocol. Most commonly seen in WebSocket upgrades (&lt;code&gt;Upgrade: websocket&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;102&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Processing&lt;/td&gt;
&lt;td&gt;The server received the request and is processing it, but hasn't finished. Prevents the client from timing out during long operations. (WebDAV)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;103&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Early Hints&lt;/td&gt;
&lt;td&gt;The server sends preliminary response headers (e.g., &lt;code&gt;Link: rel=preload&lt;/code&gt;) before the final response. Allows browsers to start preloading assets early.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Monitoring relevance:&lt;/strong&gt; 101 appears in WebSocket health checks. 103 is a CDN optimization feature. You won't monitor against 1xx codes in standard uptime monitoring.&lt;/p&gt;




&lt;h2&gt;
  
  
  2xx - Success
&lt;/h2&gt;

&lt;p&gt;The request was received, understood, and processed. The specific 2xx code tells you &lt;em&gt;how&lt;/em&gt; it was processed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;200&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;Standard success. The response body contains the requested data.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;201&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Created&lt;/td&gt;
&lt;td&gt;A new resource was created. Typically returned after a successful POST. The &lt;code&gt;Location&lt;/code&gt; header usually points to the new resource.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;202&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Accepted&lt;/td&gt;
&lt;td&gt;The request was accepted for processing, but processing hasn't completed. Used for async operations where the server queues work.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;203&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Non-Authoritative Information&lt;/td&gt;
&lt;td&gt;The response comes from a third-party proxy, not the origin server. The body may differ from what the origin would have returned.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;204&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No Content&lt;/td&gt;
&lt;td&gt;The request succeeded but there's nothing to return. Common in DELETE operations, OPTIONS preflight responses, and PATCH calls where no body is needed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;205&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reset Content&lt;/td&gt;
&lt;td&gt;Success, and the client should reset the document view (e.g., clear a form). Rarely used in practice.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;206&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Partial Content&lt;/td&gt;
&lt;td&gt;The server is delivering only part of the resource. Used for range requests - resumable downloads, video streaming, large file chunking.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;207&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-Status&lt;/td&gt;
&lt;td&gt;The response body contains multiple status codes for multiple sub-requests. (WebDAV)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;208&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Already Reported&lt;/td&gt;
&lt;td&gt;Resources have already been listed in a previous response. Prevents infinite loops in DAV tree traversal. (WebDAV)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;226&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IM Used&lt;/td&gt;
&lt;td&gt;The server fulfilled a GET request using delta encoding. (HTTP Delta Encoding, RFC 3229)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  2xx codes you'll encounter most
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;200 OK&lt;/strong&gt; - 95%+ of successful responses. Configure your monitors to expect 200 from health check endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;201 Created&lt;/strong&gt; - Verify your API returns this after POST requests that create resources. If your API returns 200 on creation instead of 201, it works but doesn't follow REST conventions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;204 No Content&lt;/strong&gt; - Common from DELETE endpoints and webhooks. If your uptime monitor checks a DELETE endpoint and expects a body, 204 will look like a failure. Configure body checks carefully on these endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;206 Partial Content&lt;/strong&gt; - Relevant when monitoring media streaming endpoints. A 206 on a streaming endpoint is healthy behavior, not a failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring tip:&lt;/strong&gt; A 200 response doesn't always mean healthy. Load balancers return 200 with error pages. CDNs return 200 with stale cached content. Configure your monitor to also validate a keyword in the response body (e.g., &lt;code&gt;"status":"ok"&lt;/code&gt;) to catch these cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  3xx - Redirection
&lt;/h2&gt;

&lt;p&gt;The client needs to take additional action to complete the request, usually by following a redirect.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;300&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple Choices&lt;/td&gt;
&lt;td&gt;The resource has multiple representations. The server provides options - the client chooses. Rarely used in practice.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;301&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Moved Permanently&lt;/td&gt;
&lt;td&gt;The resource has a new permanent URL. Clients and crawlers should update their references. Cached by browsers and proxies.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;302&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Found&lt;/td&gt;
&lt;td&gt;Temporary redirect. The resource is temporarily at a different URL. Clients should continue to use the original URL for future requests.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;303&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;See Other&lt;/td&gt;
&lt;td&gt;Redirect to a different URL, and use GET to retrieve it. Used after POST/PUT to redirect to a confirmation page (Post/Redirect/Get pattern).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;304&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not Modified&lt;/td&gt;
&lt;td&gt;The resource hasn't changed since the client's cached version. No body is returned - the client uses its cache. Requires &lt;code&gt;If-Modified-Since&lt;/code&gt; or &lt;code&gt;If-None-Match&lt;/code&gt; in the request.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;307&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Temporary Redirect&lt;/td&gt;
&lt;td&gt;Redirect, but the method and body must be preserved. Unlike 302, a POST stays a POST after the redirect.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;308&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Permanent Redirect&lt;/td&gt;
&lt;td&gt;Like 301, but the method and body must be preserved. A POST to a 308 URL stays a POST at the new URL.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3xx codes you'll encounter most
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;301 Moved Permanently&lt;/strong&gt; - HTTP → HTTPS redirects, domain migrations, URL restructuring. Your monitoring tool should follow redirects by default. If it doesn't, a site that redirects HTTP to HTTPS will always trigger an alert.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;302 Found&lt;/strong&gt; - Temporary redirects. Common in login flows, A/B testing, and temporary maintenance pages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;304 Not Modified&lt;/strong&gt; - Normal caching behavior. If your uptime monitor sends conditional requests and gets 304, it's a valid healthy response - configure your monitor to accept it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;307 vs. 302&lt;/strong&gt; - If you're running a redirect after a POST (e.g., redirect after form submission), 307 preserves the POST method while 302 doesn't guarantee it. Modern clients treat 302 as a GET redirect in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring tip:&lt;/strong&gt; If your monitor detects a redirect chain longer than 3-4 hops, that's a misconfiguration worth investigating. Excessive redirect chains add latency and can cause loops.&lt;/p&gt;




&lt;h2&gt;
  
  
  4xx - Client Errors
&lt;/h2&gt;

&lt;p&gt;The server received the request but couldn't process it because of a problem with the request itself. The client - browser, API consumer, or monitoring probe - sent something invalid.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;400&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bad Request&lt;/td&gt;
&lt;td&gt;The server can't process the request due to malformed syntax, invalid parameters, or deceptive routing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;401&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unauthorized&lt;/td&gt;
&lt;td&gt;Authentication required. The client hasn't provided credentials or provided invalid ones. The &lt;code&gt;WWW-Authenticate&lt;/code&gt; header tells the client what authentication scheme to use.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;402&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Payment Required&lt;/td&gt;
&lt;td&gt;Reserved for future use, originally intended for digital payments. Some APIs use it for rate-limiting behind paywalls.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;403&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Forbidden&lt;/td&gt;
&lt;td&gt;The server understands the request but refuses to authorize it. The client is authenticated but lacks permission. Unlike 401, re-authenticating won't help.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;404&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not Found&lt;/td&gt;
&lt;td&gt;The resource doesn't exist at this URL. May be permanent or temporary. The server isn't saying whether it ever existed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;405&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Method Not Allowed&lt;/td&gt;
&lt;td&gt;The HTTP method used isn't supported for this resource. A &lt;code&gt;GET&lt;/code&gt; request to an endpoint that only accepts &lt;code&gt;POST&lt;/code&gt;. The response includes an &lt;code&gt;Allow&lt;/code&gt; header listing valid methods.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;406&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not Acceptable&lt;/td&gt;
&lt;td&gt;The server can't produce a response matching the client's &lt;code&gt;Accept&lt;/code&gt; headers. The server can't provide the content type the client requested.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;407&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Proxy Authentication Required&lt;/td&gt;
&lt;td&gt;Like 401, but the proxy (not the origin server) requires authentication.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;408&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Request Timeout&lt;/td&gt;
&lt;td&gt;The client took too long to send the full request. The server closed the connection.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;409&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conflict&lt;/td&gt;
&lt;td&gt;The request conflicts with the current state of the resource. Common in concurrent update scenarios - two clients trying to modify the same resource simultaneously.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;410&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gone&lt;/td&gt;
&lt;td&gt;The resource existed but was permanently removed. Unlike 404, the server explicitly confirms it's gone forever.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;411&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Length Required&lt;/td&gt;
&lt;td&gt;The server requires a &lt;code&gt;Content-Length&lt;/code&gt; header but the request didn't include one.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;412&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Precondition Failed&lt;/td&gt;
&lt;td&gt;Conditional request headers (&lt;code&gt;If-Match&lt;/code&gt;, &lt;code&gt;If-Unmodified-Since&lt;/code&gt;) didn't match the resource's current state.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;413&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Content Too Large&lt;/td&gt;
&lt;td&gt;The request body exceeds the server's allowed size. Common when uploading files that exceed configured limits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;414&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;URI Too Long&lt;/td&gt;
&lt;td&gt;The request URI is longer than the server will process. Usually caused by extremely long query strings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;415&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unsupported Media Type&lt;/td&gt;
&lt;td&gt;The server won't accept the request because the &lt;code&gt;Content-Type&lt;/code&gt; doesn't match what it expects. Sending XML to an endpoint that only accepts JSON.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;416&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Range Not Satisfiable&lt;/td&gt;
&lt;td&gt;The range in a range request (&lt;code&gt;Range: bytes=500-999&lt;/code&gt;) doesn't overlap with the actual resource.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;417&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expectation Failed&lt;/td&gt;
&lt;td&gt;The server can't meet the requirements specified in the &lt;code&gt;Expect&lt;/code&gt; request header.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;418&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;I'm a Teapot&lt;/td&gt;
&lt;td&gt;An April Fools' joke from RFC 2324 (1998). A teapot refuses to brew coffee. Some APIs use it as a custom error code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;421&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Misdirected Request&lt;/td&gt;
&lt;td&gt;The request was directed at a server that can't produce a response. Common in misconfigured TLS/SNI setups.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;422&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unprocessable Content&lt;/td&gt;
&lt;td&gt;The request is well-formed but contains semantic errors. Common in REST APIs: the JSON is valid, but the values are logically invalid.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;423&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Locked&lt;/td&gt;
&lt;td&gt;The resource is locked. (WebDAV)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;424&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Failed Dependency&lt;/td&gt;
&lt;td&gt;A previous request in a batch failed, causing this one to fail. (WebDAV)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;425&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Too Early&lt;/td&gt;
&lt;td&gt;The server won't process the request because it might be a replay attack. (TLS 1.3 early data)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;426&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upgrade Required&lt;/td&gt;
&lt;td&gt;The client must switch to a different protocol (specified in &lt;code&gt;Upgrade&lt;/code&gt; header) to use this endpoint.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;428&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Precondition Required&lt;/td&gt;
&lt;td&gt;The server requires conditional request headers (&lt;code&gt;If-Match&lt;/code&gt;) to prevent lost updates - but the client didn't send them.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Too Many Requests&lt;/td&gt;
&lt;td&gt;The client has sent too many requests in a given time window. The response usually includes &lt;code&gt;Retry-After&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;431&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Request Header Fields Too Large&lt;/td&gt;
&lt;td&gt;The request headers are too large for the server to process.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;451&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unavailable For Legal Reasons&lt;/td&gt;
&lt;td&gt;The resource is unavailable due to legal demands - copyright, court orders, government censorship. Named after Fahrenheit 451.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  4xx codes you'll encounter most in monitoring
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;400 Bad Request&lt;/strong&gt; - If your monitor hits a 400, check the request configuration. The endpoint changed its expected parameters and your monitor's request is now malformed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;401 Unauthorized&lt;/strong&gt; - Your monitor is hitting an authenticated endpoint without credentials, or credentials expired. Update the monitor's authentication configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;403 Forbidden&lt;/strong&gt; - The server actively refuses the request. Common causes: IP allowlist that doesn't include your monitoring probe IPs, rate limiting, or a security policy change. Check if your monitoring provider's IP ranges are allowlisted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;404 Not Found&lt;/strong&gt; - The monitored URL was deleted, renamed, or never existed. Verify the URL is correct. Don't monitor staging endpoints that get deleted between deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;429 Too Many Requests&lt;/strong&gt; - Your monitoring probe is hitting a rate limit. Increase check intervals or whitelist monitoring IPs from rate limiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring tip:&lt;/strong&gt; 4xx responses from uptime monitors usually indicate a misconfigured monitor, not a real outage. If you're getting 401 or 403 alerts from a production endpoint that was working, check whether authentication credentials rotated or IP allowlists changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  5xx - Server Errors
&lt;/h2&gt;

&lt;p&gt;The server received a valid request and failed to fulfill it. These represent genuine server-side problems.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;500&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Internal Server Error&lt;/td&gt;
&lt;td&gt;A generic server-side failure. The server encountered an unexpected condition. Check server logs immediately.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;501&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not Implemented&lt;/td&gt;
&lt;td&gt;The server doesn't support the functionality required to fulfill the request. The request method isn't supported at all (unlike 405, which is per-resource).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;502&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bad Gateway&lt;/td&gt;
&lt;td&gt;The server is acting as a gateway or proxy and received an invalid response from the upstream server.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;503&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service Unavailable&lt;/td&gt;
&lt;td&gt;The server is temporarily unable to handle the request - due to overload, maintenance, or a crashed upstream. Often includes a &lt;code&gt;Retry-After&lt;/code&gt; header.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;504&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gateway Timeout&lt;/td&gt;
&lt;td&gt;The server (acting as a gateway) timed out waiting for a response from an upstream server.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;505&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP Version Not Supported&lt;/td&gt;
&lt;td&gt;The server doesn't support the HTTP version used in the request.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;506&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Variant Also Negotiates&lt;/td&gt;
&lt;td&gt;Server configuration error in content negotiation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;507&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Insufficient Storage&lt;/td&gt;
&lt;td&gt;The server can't store the representation needed to complete the request. (WebDAV, also used by some APIs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;508&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Loop Detected&lt;/td&gt;
&lt;td&gt;The server detected an infinite loop while processing. (WebDAV)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;510&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not Extended&lt;/td&gt;
&lt;td&gt;The server requires further extensions to fulfill the request.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;511&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Network Authentication Required&lt;/td&gt;
&lt;td&gt;The client must authenticate to gain network access. Used by captive portals (hotel Wi-Fi, etc.).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  5xx codes you'll encounter most in monitoring
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;500 Internal Server Error&lt;/strong&gt; - The catch-all server failure. Your application threw an unhandled exception, crashed, or hit a bug. Check application logs immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;502 Bad Gateway&lt;/strong&gt; - Your web server (nginx/Apache) can't reach your application server (Node, Python, Ruby, etc.). The upstream process crashed, isn't running, or isn't accepting connections. Check if your app server process is running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;503 Service Unavailable&lt;/strong&gt; - The service is intentionally or unintentionally offline. During planned maintenance, return 503 with a &lt;code&gt;Retry-After&lt;/code&gt; header. During unplanned outages, 503 usually means your app is down or overwhelmed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;504 Gateway Timeout&lt;/strong&gt; - A slow database query, external API call, or background process is blocking your web server from responding within the timeout window. The upstream is alive but too slow.&lt;/p&gt;

&lt;h3&gt;
  
  
  502 vs. 503 vs. 504: the practical difference
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;th&gt;First thing to check&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;502&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upstream is down or returning errors&lt;/td&gt;
&lt;td&gt;Is the app server process running?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;503&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service is unavailable&lt;/td&gt;
&lt;td&gt;Is the service overloaded? Is maintenance active?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;504&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upstream is alive but too slow&lt;/td&gt;
&lt;td&gt;Are there slow database queries? External API timeouts?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Monitoring tip:&lt;/strong&gt; Configure your monitoring tool to alert immediately on any 5xx from production endpoints. A single 500 from a health check endpoint that normally returns 200 is worth investigating. 5xx on a health endpoint almost always indicates a real problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference: Codes by Situation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  During deployment
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;502&lt;/td&gt;
&lt;td&gt;App server not yet started after deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;503&lt;/td&gt;
&lt;td&gt;Zero-downtime deployment in progress&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;Code bug introduced in the new release&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Auth-related
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;401&lt;/td&gt;
&lt;td&gt;Missing or expired credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;403&lt;/td&gt;
&lt;td&gt;Valid credentials, insufficient permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;407&lt;/td&gt;
&lt;td&gt;Proxy authentication required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Rate limiting
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;429&lt;/td&gt;
&lt;td&gt;Client sent too many requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;503&lt;/td&gt;
&lt;td&gt;Server-side throttling (not per-client)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  API errors
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;td&gt;Malformed request body or invalid parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;409&lt;/td&gt;
&lt;td&gt;Concurrent edit conflict&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;422&lt;/td&gt;
&lt;td&gt;Valid syntax, invalid business logic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Redirects to know
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;301&lt;/td&gt;
&lt;td&gt;Permanent, GET after redirect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;308&lt;/td&gt;
&lt;td&gt;Permanent, preserves method&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;302&lt;/td&gt;
&lt;td&gt;Temporary, GET after redirect (in practice)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;307&lt;/td&gt;
&lt;td&gt;Temporary, preserves method&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What to Monitor Against
&lt;/h2&gt;

&lt;p&gt;For uptime monitoring, the most useful configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alert on&lt;/strong&gt;: Any 5xx response from production endpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert on&lt;/strong&gt;: 4xx responses that change from baseline (a 200 suddenly returning 404 or 403)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't alert on&lt;/strong&gt;: 301/302 if your monitor follows redirects and the final destination returns 200&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't alert on&lt;/strong&gt;: 304 if your monitor sends conditional requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate body content&lt;/strong&gt;: Don't rely on status code alone - a 200 with an error page in the body is a failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most dangerous monitoring gap isn't alerting on 500 - it's a service returning 200 with an upstream error page because the load balancer is still responding while the app is down.&lt;/p&gt;

</description>
      <category>api</category>
      <category>backend</category>
      <category>beginners</category>
      <category>webdev</category>
    </item>
    <item>
      <title>GitHub Outages in 2026: A Month-by-Month Analysis</title>
      <dc:creator>Vantaj</dc:creator>
      <pubDate>Thu, 02 Jul 2026 14:26:22 +0000</pubDate>
      <link>https://dev.to/vantaj_co/github-outages-in-2026-a-month-by-month-analysis-3h0g</link>
      <guid>https://dev.to/vantaj_co/github-outages-in-2026-a-month-by-month-analysis-3h0g</guid>
      <description>&lt;p&gt;GitHub is the world's largest code hosting platform, running services that 100 million developers depend on daily. When it goes down, CI/CD pipelines stall, deployments block, and teams lose access to code. Understanding when and why it fails - with real data, not vague status summaries - helps engineering teams build better contingency plans.&lt;/p&gt;

&lt;p&gt;This analysis covers every public GitHub incident from May 27 through June 26, 2026, sourced directly from &lt;a href="https://www.githubstatus.com" rel="noopener noreferrer"&gt;githubstatus.com&lt;/a&gt;. All durations, error rates, and root causes are taken from GitHub's own incident postmortems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Incident Summary: May 27 – June 26, 2026
&lt;/h2&gt;

&lt;p&gt;GitHub reported 25 incidents over this 30-day period. That averages to nearly one incident per calendar day - though most were narrow in scope (Copilot-specific or single-service), and several resolved in under 15 minutes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Incident&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;May 27&lt;/td&gt;
&lt;td&gt;Git operations, PRs, Issues, API&lt;/td&gt;
&lt;td&gt;69 min&lt;/td&gt;
&lt;td&gt;Analytics component CPU saturation (cascade)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;May 28&lt;/td&gt;
&lt;td&gt;Multiple services elevated errors&lt;/td&gt;
&lt;td&gt;9 min&lt;/td&gt;
&lt;td&gt;Partial auth service deployment, rolled back&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 1&lt;/td&gt;
&lt;td&gt;OpenAI models disruption&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Upstream AI provider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 1&lt;/td&gt;
&lt;td&gt;Some GitHub services&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 4&lt;/td&gt;
&lt;td&gt;Webhook APIs and UI degraded&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 5&lt;/td&gt;
&lt;td&gt;Auth/API (0.11% wrong 404s) + Slack/Teams&lt;/td&gt;
&lt;td&gt;70 min&lt;/td&gt;
&lt;td&gt;Authorization component bug with user tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 6&lt;/td&gt;
&lt;td&gt;EU region: Codeload and Package Registry&lt;/td&gt;
&lt;td&gt;43 min&lt;/td&gt;
&lt;td&gt;Network circuit migration disrupted EU PoP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 8&lt;/td&gt;
&lt;td&gt;GitHub.com, REST API, GraphQL, Webhooks&lt;/td&gt;
&lt;td&gt;5-12 min&lt;/td&gt;
&lt;td&gt;Transient infrastructure capacity, self-resolved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 8&lt;/td&gt;
&lt;td&gt;Copilot Code Review failing&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 11&lt;/td&gt;
&lt;td&gt;Webhooks delayed&lt;/td&gt;
&lt;td&gt;~160 min&lt;/td&gt;
&lt;td&gt;Not detailed in postmortem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12&lt;/td&gt;
&lt;td&gt;EU region disruption&lt;/td&gt;
&lt;td&gt;Linked to Jun 6&lt;/td&gt;
&lt;td&gt;Network migration (same root cause)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 12&lt;/td&gt;
&lt;td&gt;Code Scanning and Billing delays&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 15&lt;/td&gt;
&lt;td&gt;Feature flag service failure (analytics)&lt;/td&gt;
&lt;td&gt;44 min&lt;/td&gt;
&lt;td&gt;Feature flag client transient error, no retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 16&lt;/td&gt;
&lt;td&gt;Pull Requests and Issues (signed-out)&lt;/td&gt;
&lt;td&gt;55 min&lt;/td&gt;
&lt;td&gt;Upstream model provider (Opus 4.8)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 17&lt;/td&gt;
&lt;td&gt;Copilot availability&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 18&lt;/td&gt;
&lt;td&gt;Auth/API (9% sporadic 401s, +800ms latency)&lt;/td&gt;
&lt;td&gt;80 min&lt;/td&gt;
&lt;td&gt;memcached misconfiguration during rollout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 18&lt;/td&gt;
&lt;td&gt;Feature flags service elevated errors&lt;/td&gt;
&lt;td&gt;Linked to Jun 15&lt;/td&gt;
&lt;td&gt;Same feature flag service issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 19&lt;/td&gt;
&lt;td&gt;Webhooks incident&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 19&lt;/td&gt;
&lt;td&gt;Copilot next edit suggestions&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 23&lt;/td&gt;
&lt;td&gt;Copilot next edit suggestions elevated errors&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 24&lt;/td&gt;
&lt;td&gt;Some GitHub services&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 25&lt;/td&gt;
&lt;td&gt;Webhooks latency increased&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;td&gt;Not detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 25&lt;/td&gt;
&lt;td&gt;Webhooks, PRs, Actions, Issues degradation&lt;/td&gt;
&lt;td&gt;Resolved 18:27 UTC&lt;/td&gt;
&lt;td&gt;Not fully detailed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Five Most Significant Incidents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. May 27 - Git Operations Cascade (69 minutes)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; 3.5% of HTTPS pushes failed. 0.2% of SSH pushes failed. Pull Requests, Issues, GraphQL API degraded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; An internal analytics component generated unexpectedly high load, saturating CPU on the underlying infrastructure. Services that depended on Git operations began failing as a cascade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resolution:&lt;/strong&gt; GitHub stopped the offending analytics component. Services recovered shortly after.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What went wrong:&lt;/strong&gt; An internal background system - not directly user-facing - created enough load to degrade core user-facing services. The analytics component lacked resource limits or circuit breakers that would have contained its impact.&lt;/p&gt;

&lt;p&gt;GitHub noted in the postmortem: &lt;em&gt;"We are taking steps to add resource limits and kill switches."&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  2. May 28 - Partial Deployment Triggers Multi-Service Errors (9 minutes)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; 10% of GitHub Actions runs failed to queue or encountered errors. Web experience, REST API, and Git operations all affected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; A change partially deployed to an authentication service caused dependent services to fail. The partial rollout state - neither the old version nor the new one fully applied - was the failure mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resolution:&lt;/strong&gt; GitHub rolled back the change. Recovery was fast because the rollback was straightforward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What went wrong:&lt;/strong&gt; The deployment validation process didn't catch that a partial deployment would produce an inconsistent state that downstream services couldn't handle.&lt;/p&gt;

&lt;p&gt;GitHub noted: &lt;em&gt;"We are expanding test coverage and improving our deployment validation process."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a common pattern in large distributed systems: safe to deploy fully, unsafe to deploy partially.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. June 5 - Authorization Bug Deletes Slack/Teams Subscriptions (70 minutes)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; 0.11% of authenticated REST API requests returned incorrect "not found" responses. 12% of organizations with active Slack and Teams channel subscriptions had some subscriptions removed. 2% of all channel subscriptions deleted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; A change to an internal authorization component introduced a bug that failed to correctly resolve user-to-server token access for organization-owned repositories. The Slack and Teams integrations interpreted the transient "not found" responses as permanent loss of access and deleted the subscriptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resolution:&lt;/strong&gt; GitHub reverted the authorization component change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What went wrong:&lt;/strong&gt; The authorization bug itself was one failure. But the bigger failure mode was the integrations treating a transient error as permanent. When the API returned 404, the Slack integration assumed the repository was gone and removed the subscription - irreversibly. Recovering deleted subscriptions required users to manually re-add them.&lt;/p&gt;

&lt;p&gt;This illustrates a dangerous API consumer pattern: treating any "not found" as permanent action-required, rather than distinguishing between transient and durable errors.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. June 18 - memcached Misconfiguration Causes 9% Auth Failures (80 minutes)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; ~9% of API requests returned sporadic 401 errors. ~800ms of additional latency on affected requests. Users experienced intermittent "logged out" behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; A memcached proxy service rollout to GitHub's internal API infrastructure caused the authentication service to pick up an incorrect memcached host configuration. When authentication lookups went to the wrong host, they failed - intermittently, not consistently, which made the issue harder to diagnose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resolution:&lt;/strong&gt; GitHub deployed a configuration change to memcached to use the correct host.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What went wrong:&lt;/strong&gt; Configuration changes to infrastructure components that authentication depends on require validation before rollout. A canary deployment or pre-rollout config verification step would have caught the incorrect host before production traffic hit it.&lt;/p&gt;

&lt;p&gt;GitHub noted plans: &lt;em&gt;"We plan to migrate our authentication system to prevent similar issues."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At 80 minutes, this was the longest duration incident in the period covered by detailed postmortems.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. June 6 - EU Network Migration Disrupts Package Registry (43 minutes)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; 0.95% average Codeload error rate. 9.2% average Package Registry error rate. Peak Package Registry errors reached 27%. Affected users whose traffic routed through European infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; A planned network circuit migration disrupted connectivity at one of GitHub's European Points of Presence. The traffic-shifting process "did not operate as expected," leaving some production traffic routed through the affected site.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resolution:&lt;/strong&gt; Traffic shifted away from the affected PoP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What went wrong:&lt;/strong&gt; Planned maintenance caused an unplanned outage. The traffic-shifting procedure had a failure mode that the team hadn't fully anticipated. Package Registry errors hit 27% at peak - significant for teams doing package installs in CI pipelines routed through EU infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recurring Failure Patterns
&lt;/h2&gt;

&lt;p&gt;Across the 25 incidents in this period, four patterns account for most of the impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Webhooks (5 incidents)
&lt;/h3&gt;

&lt;p&gt;Webhooks degraded or failed on June 4, June 11, June 19, and June 25 (twice). No single postmortem in this dataset explains what causes GitHub's webhook delivery to fail repeatedly. The frequency suggests either fragile infrastructure or a shared dependency that's hit by multiple different upstream issues.&lt;/p&gt;

&lt;p&gt;For teams that depend on webhooks for CI/CD triggers, deployment notifications, or workflow automations, GitHub webhook failures are a significant operational risk. Having a secondary delivery mechanism or monitoring for missed webhook events is worth the investment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Copilot AI Services (6 incidents)
&lt;/h3&gt;

&lt;p&gt;Copilot-specific incidents appeared on June 1, June 8, June 17, June 19, June 23, and affected June 16's model disruption. GitHub Copilot depends on external AI model providers (OpenAI, Anthropic), which introduces a dependency layer outside GitHub's direct control.&lt;/p&gt;

&lt;p&gt;These incidents are largely independent of core GitHub services. If Copilot completions fail, PRs and Issues continue working normally. But for teams where Copilot is integrated into developer workflows, the frequency of AI model disruptions is notable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Deployment-Triggered Failures
&lt;/h3&gt;

&lt;p&gt;Two of the five detailed incidents trace directly to a deployment or rollout: the May 28 partial authentication deployment and the June 18 memcached rollout.&lt;/p&gt;

&lt;p&gt;Both could have been caught earlier with stricter pre-deployment validation. Both resolved quickly once identified. Both caused disproportionate impact relative to the change being made - the May 28 incident affected 10% of Actions runs from a single configuration change.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 4: Auth and API Instability
&lt;/h3&gt;

&lt;p&gt;The June 5 authorization bug and June 18 memcached issue both affected authentication. Auth is a foundational dependency - when it degrades intermittently, every service that requires authentication sees errors. The 80-minute duration of June 18 and the subscription deletion side effect of June 5 make these the highest-impact incident types in this dataset.&lt;/p&gt;




&lt;h2&gt;
  
  
  Incident Frequency by Affected Service
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Incidents (May 27 – Jun 26)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Webhooks&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copilot / AI features&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API / Auth&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Core GitHub services (PRs, Issues, Git)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EU / Regional&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Other (Code Scanning, Billing)&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Uptime Estimates
&lt;/h2&gt;

&lt;p&gt;GitHub doesn't publish an overall uptime percentage on their status page. Based on the detailed postmortem durations available:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Incident&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;May 27 Git cascade&lt;/td&gt;
&lt;td&gt;69 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;May 28 Auth deployment&lt;/td&gt;
&lt;td&gt;9 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 5 Auth/API/Slack&lt;/td&gt;
&lt;td&gt;70 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 6 EU network&lt;/td&gt;
&lt;td&gt;43 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 8 GitHub.com/API&lt;/td&gt;
&lt;td&gt;5-12 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 11 Webhooks&lt;/td&gt;
&lt;td&gt;~160 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 15 Feature flags&lt;/td&gt;
&lt;td&gt;44 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jun 18 Auth/API memcached&lt;/td&gt;
&lt;td&gt;80 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total (documented)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~500 min over 30 days&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;500 minutes of documented degradation over 30 days (43,200 minutes) represents roughly 98.8% availability for the services specifically affected during those windows - not accounting for the many incidents without detailed duration data.&lt;/p&gt;

&lt;p&gt;This aligns with GitHub's informal track record of 99.x% availability, with occasional multi-hour events and frequent short-lived degradations.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for Teams That Depend on GitHub
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Don't build pipelines with a single webhook trigger.&lt;/strong&gt; Webhooks are GitHub's most unreliable service based on this dataset - five incidents in one month. If a missed webhook blocks a deployment or notification, build a polling fallback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model AI feature dependency separately.&lt;/strong&gt; Copilot, Code Review AI, and AI-powered features depend on upstream model providers that GitHub doesn't control. Design workflows that degrade gracefully when Copilot is unavailable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor your integration points.&lt;/strong&gt; The June 5 incident deleted Slack/Teams subscriptions silently. If your GitHub Slack integration had stopped posting notifications, your team might not have noticed for hours. Monitor the output of your GitHub integrations, not just GitHub's status page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch for EU-specific issues.&lt;/strong&gt; Two incidents in this period specifically affected European infrastructure. If your team routes CI/CD through EU GitHub infrastructure, regional monitoring that checks from inside Europe gives earlier signal than a US-based check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch the GitHub Status API.&lt;/strong&gt; GitHub publishes machine-readable status at &lt;a href="https://api.githubstatus.com/v2/summary.json" rel="noopener noreferrer"&gt;api.githubstatus.com/v2/summary.json&lt;/a&gt;. Monitor that endpoint programmatically or subscribe to status page notifications so you get the first alert, not the second-hand report from a developer who noticed their PR wasn't building.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All incident data sourced from &lt;a href="https://www.githubstatus.com" rel="noopener noreferrer"&gt;githubstatus.com&lt;/a&gt; and GitHub's published postmortems. Durations and error rates are taken verbatim from GitHub's own incident reports. This analysis covers the 30-day window available in the public incident feed at time of writing (June 26, 2026).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>analysis</category>
      <category>devops</category>
      <category>github</category>
      <category>sre</category>
    </item>
    <item>
      <title>Alert Fatigue Is Your Tool's Fault, Not Your Infrastructure's</title>
      <dc:creator>Vantaj</dc:creator>
      <pubDate>Thu, 02 Jul 2026 14:24:55 +0000</pubDate>
      <link>https://dev.to/vantaj_co/alert-fatigue-is-your-tools-fault-not-your-infrastructures-c9i</link>
      <guid>https://dev.to/vantaj_co/alert-fatigue-is-your-tools-fault-not-your-infrastructures-c9i</guid>
      <description>&lt;p&gt;Teams blame noisy infrastructure for alert fatigue. The real culprit is monitoring tools that fire on every blip. Here's why the problem is architectural - and what to do about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Reason Your Team Ignores Alerts
&lt;/h2&gt;

&lt;p&gt;There's a pattern we see over and over. A team sets up monitoring. The first week, everyone responds to every alert within minutes. By week three, the median response time doubles. By month two, someone creates a Slack channel called &lt;code&gt;#alerts-graveyard&lt;/code&gt; and routes everything there.&lt;/p&gt;

&lt;p&gt;The team blames their infrastructure. "Our services are just flaky." "Kubernetes pods restart sometimes, it's normal." "The network hiccups at 2 AM, nothing we can do."&lt;/p&gt;

&lt;p&gt;But the infrastructure isn't the problem. The monitoring tool is.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Monitoring Tools Train You to Ignore Alerts
&lt;/h2&gt;

&lt;p&gt;Alert fatigue doesn't happen overnight. It's a gradual erosion of trust, and it follows a predictable cycle:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Vigilance.&lt;/strong&gt; Tool is new. Every alert gets investigated. Team feels in control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Doubt.&lt;/strong&gt; After the fifth false positive in a week, someone says "probably nothing" before checking. Investigations get shorter. Some alerts get acknowledged without looking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Filtering.&lt;/strong&gt; The team creates rules to suppress the noisiest monitors. They mute Slack notifications for non-critical services. They stop checking the monitoring dashboard unless something else confirms an issue - a customer complaint, a spike in error rates, a colleague mentioning it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 4: Abandonment.&lt;/strong&gt; Alerts are effectively ignored. The monitoring tool is running, the dashboard is green, but nobody trusts it. When a real outage happens, the team finds out from customers. The monitoring tool sent an alert 12 minutes ago. Nobody saw it.&lt;/p&gt;

&lt;p&gt;This isn't a discipline problem. This is a design problem. The tool trained the team to stop paying attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture of Bad Alerts
&lt;/h2&gt;

&lt;p&gt;Most monitoring tools are built on architecture that makes false positives inevitable. Here's what's happening under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  One Probe, One Vote
&lt;/h3&gt;

&lt;p&gt;The simplest monitoring architecture is a single server that sends requests to your endpoints on a schedule. If the request fails, an alert fires.&lt;/p&gt;

&lt;p&gt;The problem: networks are messy. Between your monitoring probe and your server, there are dozens of hops - routers, switches, ISPs, CDN edges, load balancers. Any one of them can hiccup. A packet gets dropped. A DNS response is delayed. A TLS handshake times out because of a transient issue at a certificate authority.&lt;/p&gt;

&lt;p&gt;None of these are your problem. Your users aren't affected. But your monitoring tool doesn't know that, because it only has one vantage point.&lt;/p&gt;

&lt;p&gt;This is like diagnosing a city's traffic based on one intersection. If that intersection has a fender bender, you'd conclude the entire city is gridlocked.&lt;/p&gt;

&lt;h3&gt;
  
  
  Threshold Roulette
&lt;/h3&gt;

&lt;p&gt;Most tools let you configure timeout thresholds - how long to wait before declaring a check "failed." The default is usually 3–5 seconds, and most teams leave it there.&lt;/p&gt;

&lt;p&gt;But here's the thing: response time isn't constant. Your API might respond in 200ms at 10 AM and 3.2 seconds at 2 PM during a traffic spike. Both are normal. A 3-second timeout treats the afternoon spike as a failure.&lt;/p&gt;

&lt;p&gt;Now your monitoring tool is alerting on load patterns that have been happening since launch. It's not detecting a problem - it's detecting Tuesday.&lt;/p&gt;

&lt;h3&gt;
  
  
  No Memory, No Context
&lt;/h3&gt;

&lt;p&gt;Most monitoring tools treat every check as independent. They don't know that the same endpoint "failed" for 0.3 seconds last Tuesday and recovered immediately. They don't know that the last 4,000 checks were successful. They don't know that the failure correlates with a known AWS maintenance window.&lt;/p&gt;

&lt;p&gt;Each check exists in a vacuum. Pass or fail. Alert or don't. There's no concept of "this looks like a blip" versus "this looks like a real outage."&lt;/p&gt;

&lt;h3&gt;
  
  
  Alert-Per-Check Design
&lt;/h3&gt;

&lt;p&gt;The most egregious architectural flaw: many tools generate one alert per failed check, not one alert per incident. If your service flaps - up, down, up, down - you get four notifications in ten minutes. Each one buzzes your phone, sends an email, and posts to Slack.&lt;/p&gt;

&lt;p&gt;After the third buzz in five minutes, you stop looking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Math of Alert Fatigue
&lt;/h2&gt;

&lt;p&gt;Let's put some numbers on this.&lt;/p&gt;

&lt;p&gt;Say you have 30 monitors, each checking every 5 minutes. That's 8,640 checks per day across all monitors.&lt;/p&gt;

&lt;p&gt;If your false positive rate is 0.5% - which sounds tiny - that's &lt;strong&gt;43 false alerts per day&lt;/strong&gt;. Almost two per hour. One every 33 minutes.&lt;/p&gt;

&lt;p&gt;If your team works in 8-hour shifts, each person sees roughly 14 false alerts per shift. After a week, that's 100 false alerts that required investigation and turned out to be nothing.&lt;/p&gt;

&lt;p&gt;Now consider the psychological cost. Research on alarm fatigue in healthcare - where the stakes are literally life and death - shows that clinicians begin ignoring alarms when false positive rates exceed 85-99%. In engineering, the threshold is lower because the perceived consequence is lower. Teams start tuning out after just a few false positives per week.&lt;/p&gt;

&lt;p&gt;At 0.5% false positive rate, you've already lost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "Just Tune Your Thresholds" Doesn't Work
&lt;/h2&gt;

&lt;p&gt;The standard advice for alert fatigue is: tune your thresholds, add escalation policies, create runbooks. This is treating symptoms, not the disease.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tuning thresholds&lt;/strong&gt; is a never-ending game. You loosen the timeout to 10 seconds, and the false positives stop - until your next traffic spike pushes response times to 11 seconds. You tighten it back, and the 2 AM network blips start triggering again. Every threshold change is a trade-off between sensitivity and noise, and the optimal setting drifts with your traffic patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Escalation policies&lt;/strong&gt; just redistribute the fatigue. Instead of the whole team being fatigued, now your on-call rotation is fatigued. You've concentrated the misery instead of eliminating it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runbooks&lt;/strong&gt; help with real incidents. They do nothing for false positives, because the runbook says "investigate" and the investigation concludes "nothing is wrong." You've just formalized the time waste.&lt;/p&gt;

&lt;p&gt;The problem isn't configuration. The problem is that the tool's architecture guarantees noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Fixes This
&lt;/h2&gt;

&lt;p&gt;Alert fatigue is an architectural problem, and it requires an architectural solution. There are three changes that matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Multi-Region Consensus
&lt;/h3&gt;

&lt;p&gt;Instead of one probe deciding if your service is down, check from multiple independent locations and require agreement before alerting.&lt;/p&gt;

&lt;p&gt;If a check fails from Frankfurt but passes from Virginia and Singapore, it's a network issue - not an outage. If it fails from all three, something is genuinely wrong.&lt;/p&gt;

&lt;p&gt;This single change eliminates the majority of false positives. The math is simple: the probability of three independent network paths all experiencing transient failures simultaneously is negligibly small. If all three see a failure, it's real.&lt;/p&gt;

&lt;p&gt;This should be the default behavior. Not a premium feature. Not an opt-in configuration. The default.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Confirmation Before Alerting
&lt;/h3&gt;

&lt;p&gt;When a check fails (even from multiple regions), wait one check interval and verify. If the next check passes, it was a transient blip - don't alert.&lt;/p&gt;

&lt;p&gt;This adds a small delay to detection (30 seconds to 1 minute, depending on your check interval), but it filters out the short-lived failures that resolve themselves before any human could respond anyway. You weren't going to fix a 30-second blip. You probably weren't even going to finish reading the alert before it recovered.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Incident-Based Alerting, Not Check-Based
&lt;/h3&gt;

&lt;p&gt;One incident, one notification. If your service goes down and stays down, you get one alert - not a new notification every time a check runs. When it recovers, you get one recovery message.&lt;/p&gt;

&lt;p&gt;This sounds obvious, but most tools still default to per-check alerting. Five failed checks in a row means five Slack messages, five emails, five phone buzzes. Each one interrupts focus. None of them add information.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost of Getting This Wrong
&lt;/h2&gt;

&lt;p&gt;Alert fatigue isn't just annoying. It's dangerous. Here's what happens when a team stops trusting their monitoring:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slower incident response.&lt;/strong&gt; When a real outage happens, the alert sits in a channel that nobody watches. Mean time to detection goes from minutes to hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadow monitoring.&lt;/strong&gt; Engineers start building their own monitoring - a cron job that curls the endpoint, a Grafana dashboard they check manually, a personal script that sends them a text. Now you have fragmented, inconsistent monitoring with no shared visibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customer-reported outages.&lt;/strong&gt; The worst way to find out about downtime is from a customer. It means your monitoring failed at its primary job. It damages trust with the customer and confidence within the team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring abandonment.&lt;/strong&gt; Eventually, someone suggests removing the monitoring tool entirely. "We're paying $200/month for something nobody looks at." They're right - but the answer isn't less monitoring. It's better monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Audit Your Current Setup
&lt;/h2&gt;

&lt;p&gt;Before you change tools, measure where you stand:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt; Export your alert history for the last 30 days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; Categorize each alert:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Actionable&lt;/strong&gt; - required investigation, and the investigation revealed a real problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False positive&lt;/strong&gt; - investigation revealed no real issue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redundant&lt;/strong&gt; - a duplicate alert for an already-known incident&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; Calculate your signal-to-noise ratio: &lt;code&gt;actionable alerts / total alerts&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;If your ratio is below 80%, your team is spending more time investigating noise than responding to real incidents. Below 50%, your monitoring is actively making things worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4:&lt;/strong&gt; For each false positive, identify the root cause:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-region network issue?&lt;/li&gt;
&lt;li&gt;Threshold too tight?&lt;/li&gt;
&lt;li&gt;Transient blip with no confirmation?&lt;/li&gt;
&lt;li&gt;Flapping service with per-check alerting?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This tells you whether the problem is fixable with configuration changes or if the tool's architecture is fundamentally limited.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Standard That Should Exist
&lt;/h2&gt;

&lt;p&gt;Here's a simple test for any monitoring tool: &lt;strong&gt;if an alert fires, is it worth waking someone up at 3 AM?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "is there a configuration that could make it worth waking someone up." Is the default behavior - out of the box, with minimal configuration - reliable enough that every alert deserves attention?&lt;/p&gt;

&lt;p&gt;If the answer is no, the tool is training your team to ignore alerts. And a team that ignores alerts is worse than a team with no monitoring at all, because at least the team with no monitoring knows they're flying blind.&lt;/p&gt;

&lt;p&gt;The team with bad monitoring thinks they're covered.&lt;/p&gt;

&lt;p&gt;They're not.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>uptime</category>
    </item>
    <item>
      <title>Single-Region Monitoring Is Broken by Design</title>
      <dc:creator>Vantaj</dc:creator>
      <pubDate>Tue, 23 Jun 2026 17:39:42 +0000</pubDate>
      <link>https://dev.to/vantaj_co/single-region-monitoring-is-broken-by-design-3ml2</link>
      <guid>https://dev.to/vantaj_co/single-region-monitoring-is-broken-by-design-3ml2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fsynzha8gl8khxohcgoma.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fsynzha8gl8khxohcgoma.png" alt=" " width="800" height="594"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Single-Region Monitoring Fails for a Simple Reason
&lt;/h2&gt;

&lt;p&gt;If your uptime monitor checks from one location, one network path failure can look exactly like a production outage.&lt;/p&gt;

&lt;p&gt;That means a routing issue in Frankfurt, a transient DNS timeout in Singapore, or a brief transit provider hiccup between a probe and your server can all trigger the same alert: &lt;strong&gt;your site is down&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Sometimes it is.&lt;/p&gt;

&lt;p&gt;Often, it isn't.&lt;/p&gt;

&lt;p&gt;That is the core problem with single-region monitoring. It confuses &lt;strong&gt;"one path failed"&lt;/strong&gt; with &lt;strong&gt;"the service is unavailable."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 AM Alert That Wasn't Real
&lt;/h2&gt;

&lt;p&gt;Your phone buzzes at 3:17 AM.&lt;/p&gt;

&lt;p&gt;The alert says your production API is down. You open your laptop, check the dashboard, hit the health endpoint manually, look at logs, maybe restart a shell session just to be sure.&lt;/p&gt;

&lt;p&gt;Everything is fine.&lt;/p&gt;

&lt;p&gt;The failed check came from one probe in one city. Your infrastructure is healthy. Your users are unaffected. Somewhere between that probe and your server, a packet got dropped, a route flapped, or a resolver had a bad minute.&lt;/p&gt;

&lt;p&gt;But the monitoring tool does not know that. It saw one failed request and escalated the worst possible interpretation.&lt;/p&gt;

&lt;p&gt;This is how teams end up with alert fatigue. Not because their infrastructure is uniquely flaky, but because their monitoring model is too naive for how the internet actually behaves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why So Many Tools Still Work This Way
&lt;/h2&gt;

&lt;p&gt;Single-region checking is popular because it is operationally simple.&lt;/p&gt;

&lt;p&gt;One monitor gets assigned to one probe on one schedule. That is easy to scale, easy to explain, and cheap to run. For the vendor, it is efficient.&lt;/p&gt;

&lt;p&gt;For the customer, it creates a blind spot.&lt;/p&gt;

&lt;p&gt;The design assumes that if one probe cannot reach your service, the service must be down. That assumption only works if the network path between the probe and your infrastructure is perfectly reliable.&lt;/p&gt;

&lt;p&gt;It isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Internet Is a Chain of Failure Points
&lt;/h2&gt;

&lt;p&gt;A check from Frankfurt to Virginia is not a direct line. It passes through multiple systems operated by multiple companies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the monitoring provider's own network&lt;/li&gt;
&lt;li&gt;one or more transit providers&lt;/li&gt;
&lt;li&gt;internet exchange points&lt;/li&gt;
&lt;li&gt;long-haul terrestrial or submarine links&lt;/li&gt;
&lt;li&gt;your cloud or hosting provider&lt;/li&gt;
&lt;li&gt;your application itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only the last two are actually your problem.&lt;/p&gt;

&lt;p&gt;Everything before that can fail independently. And when any one of those upstream links fails, the monitoring probe sees the same thing it would see if your app were truly down: timeout, connection error, no response.&lt;/p&gt;

&lt;p&gt;A single-region monitor cannot tell the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;your application is unavailable&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;the route from that probe to your application is degraded&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why false alerts are not a tuning issue. They are an architecture issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  The False-Positive Math
&lt;/h2&gt;

&lt;p&gt;Here is the rough intuition.&lt;/p&gt;

&lt;p&gt;If a monitor checks once per minute from a single location, that is &lt;strong&gt;1,440 checks per day&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If the end-to-end path between that probe and your service is reliable &lt;strong&gt;99.95%&lt;/strong&gt; of the time, then the failure rate for that path is &lt;strong&gt;0.05% per check&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That gives you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1,440 × 0.0005 = 0.72 path-level failures per day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is roughly &lt;strong&gt;5 failed checks per week&lt;/strong&gt; caused by network path issues alone.&lt;/p&gt;

&lt;p&gt;And that is before you add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transient DNS failures&lt;/li&gt;
&lt;li&gt;TLS handshake hiccups&lt;/li&gt;
&lt;li&gt;overloaded probe nodes&lt;/li&gt;
&lt;li&gt;regional packet loss&lt;/li&gt;
&lt;li&gt;brief resolver or CDN anomalies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, it is easy to end up with &lt;strong&gt;7–10 false alerts per week&lt;/strong&gt; from a single critical monitor if the tool alerts on first failure from one region.&lt;/p&gt;

&lt;p&gt;Now multiply that across 20 monitors.&lt;/p&gt;

&lt;p&gt;Even if only a fraction of those failed checks page a human, you still burn real engineering time investigating things that were never incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  More Regions Only Help If They Agree
&lt;/h2&gt;

&lt;p&gt;This is where a lot of monitoring tools muddy the story.&lt;/p&gt;

&lt;p&gt;They advertise &lt;strong&gt;multi-region checks&lt;/strong&gt;. That sounds like the fix, but it only helps if the alerting logic uses those regions as a voting system.&lt;/p&gt;

&lt;p&gt;There is a big difference between:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;checking from multiple regions&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;requiring multiple regions to confirm failure before alerting&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Many tools do the first but not the second.&lt;/p&gt;

&lt;p&gt;They run checks from multiple locations, but if any one region fails, they still alert. That gives you more data, but it does not solve the noise problem. In some cases it makes it worse, because you now have more independent paths that can fail.&lt;/p&gt;

&lt;p&gt;What actually works is &lt;strong&gt;consensus&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If Frankfurt says "down" but Virginia and Singapore say "up," the correct conclusion is not "incident." It is "this looks regional or path-specific, keep watching."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Consensus Changes the Math
&lt;/h2&gt;

&lt;p&gt;With consensus, a false alert requires all of the confirming regions to fail at the same time.&lt;/p&gt;

&lt;p&gt;Using the same simplified reliability assumption:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0005 × 0.0005 × 0.0005 = 0.000000000125
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is &lt;strong&gt;0.0000000125%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The exact real-world number depends on how independent the network paths truly are, so you should treat this as directional rather than absolute. But the principle holds: the probability of three independent paths failing together is dramatically lower than the probability of one path failing alone.&lt;/p&gt;

&lt;p&gt;That is the entire point of consensus-based monitoring. It turns "a random path issue" into background noise instead of an incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Single-Region Monitoring Cannot Tell You
&lt;/h2&gt;

&lt;p&gt;False positives are only half the problem.&lt;/p&gt;

&lt;p&gt;Single-region monitoring also hides things you actually care about.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Regional Outages
&lt;/h3&gt;

&lt;p&gt;If your only probe is in the US and your users in Europe are seeing failures, your dashboard may stay green while your support queue fills up.&lt;/p&gt;

&lt;p&gt;CDNs, DNS providers, WAFs, and cloud regions fail regionally all the time. A single probe gives you one geography's truth, not the internet's truth.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Global Latency
&lt;/h3&gt;

&lt;p&gt;Response time from Virginia tells you nothing about what users in Tokyo or Sydney are experiencing. If you only measure one region, your latency graph can look healthy while half your users are waiting 800ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Probe Failure
&lt;/h3&gt;

&lt;p&gt;If the only probe checking your service goes dark, you lose visibility. No data, no validation, no safety net.&lt;/p&gt;

&lt;p&gt;With multi-region monitoring, one failed probe reduces coverage. With single-region monitoring, one failed probe can eliminate it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost of the Wrong Model
&lt;/h2&gt;

&lt;p&gt;Here is what the tradeoff looks like in practice:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Single-region&lt;/th&gt;
&lt;th&gt;Multi-region consensus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;False positives per week (20 monitors)&lt;/td&gt;
&lt;td&gt;7–10+&lt;/td&gt;
&lt;td&gt;Near zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineering time spent investigating noise&lt;/td&gt;
&lt;td&gt;4–6 hrs/week&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regional outage visibility&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confidence in alerts&lt;/td&gt;
&lt;td&gt;Erodes over time&lt;/td&gt;
&lt;td&gt;Stays high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 AM pages that turn out to be nothing&lt;/td&gt;
&lt;td&gt;Common&lt;/td&gt;
&lt;td&gt;Rare&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At $75/hour, five hours per week spent investigating false alerts is nearly &lt;strong&gt;$19,500 per year&lt;/strong&gt; in wasted engineering time.&lt;/p&gt;

&lt;p&gt;That does not include the harder cost: once your team learns that alerts are noisy, response urgency drops. Then a real outage happens, and those extra five minutes of doubt become expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Good Monitoring Should Do Instead
&lt;/h2&gt;

&lt;p&gt;If you are evaluating your current setup, ask five questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;How many probe regions actively check each critical service?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Does one failed region trigger an alert, or is failure verified from other locations first?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can you see per-region results clearly in the dashboard?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can the system distinguish a regional issue from a global outage?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How many alerts in the last 30 days turned out to be nothing?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer to the last question is anything above zero, there is a good chance your monitoring architecture is part of the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Vantaj Approaches It
&lt;/h2&gt;

&lt;p&gt;Vantaj uses multi-region consensus by default.&lt;/p&gt;

&lt;p&gt;When one region sees a failure, the system verifies from additional independent locations before opening an incident. If one region fails and the others succeed, it is treated as a path-level or regional issue rather than a service outage.&lt;/p&gt;

&lt;p&gt;That means the alert you get at 3 AM is much more likely to be real.&lt;/p&gt;

&lt;p&gt;And that is what a monitoring system is supposed to do: not tell you that &lt;em&gt;something somewhere&lt;/em&gt; went wrong, but tell you when &lt;strong&gt;your service is actually down.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Single-region monitoring was a reasonable compromise when monitoring infrastructure was expensive and internet paths were simpler.&lt;/p&gt;

&lt;p&gt;That is no longer the world we operate in.&lt;/p&gt;

&lt;p&gt;If your monitoring tool still treats one failed path as proof of downtime, it is optimizing for vendor simplicity, not for your reliability.&lt;/p&gt;

</description>
      <category>uptime</category>
      <category>monitoring</category>
      <category>website</category>
    </item>
  </channel>
</rss>
