<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Oleg Glybchenko</title>
    <description>The latest articles on DEV Community by Oleg Glybchenko (@lezhag).</description>
    <link>https://dev.to/lezhag</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3839054%2Fb3f15763-5206-4f30-9d85-990f16e7afaa.jpeg</url>
      <title>DEV Community: Oleg Glybchenko</title>
      <link>https://dev.to/lezhag</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lezhag"/>
    <language>en</language>
    <item>
      <title>I kept getting blamed for outages that weren't mine. So I built a tool to fight back</title>
      <dc:creator>Oleg Glybchenko</dc:creator>
      <pubDate>Mon, 23 Mar 2026 01:31:53 +0000</pubDate>
      <link>https://dev.to/lezhag/i-kept-getting-blamed-for-outages-that-werent-mine-so-i-built-a-tool-to-fight-back-4o43</link>
      <guid>https://dev.to/lezhag/i-kept-getting-blamed-for-outages-that-werent-mine-so-i-built-a-tool-to-fight-back-4o43</guid>
      <description>&lt;h2&gt;
  
  
  The villain origin story
&lt;/h2&gt;

&lt;p&gt;December 2024. 11 PM. I'm on the couch. Phone buzzes.&lt;/p&gt;

&lt;p&gt;"Hey, the AI feature is broken."&lt;/p&gt;

&lt;p&gt;I check our dashboards. Everything's green. Our servers are fine. Our database is fine. Our CDN is fine.&lt;/p&gt;

&lt;p&gt;OpenAI is down.&lt;/p&gt;

&lt;p&gt;Not us. OpenAI. Their status page? Still showing "All Systems Operational." It took them over an hour to even acknowledge it. By then I'd already gotten 14 messages from users who thought &lt;em&gt;we&lt;/em&gt; broke something.&lt;/p&gt;

&lt;p&gt;Two weeks later — same thing. OpenAI down again. 4+ hours. Same dance. Same blame.&lt;/p&gt;

&lt;p&gt;I started to notice a pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;Every monitoring tool I've ever used — Pingdom, UptimeRobot, Datadog — they all answer the same question: &lt;strong&gt;is MY site up?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's... not the question anymore.&lt;/p&gt;

&lt;p&gt;Your site is up. Your checkout is up. Your auth is up. But Stripe's API is returning 500s, so your checkout silently fails. Twilio is lagging, so your 2FA codes arrive 30 minutes late. OpenAI is down, so every AI feature you've built returns a loading spinner forever.&lt;/p&gt;

&lt;p&gt;Your infrastructure is healthy. Your users are furious. And the status pages of those providers? They'll update eventually. Maybe. If they feel like it.&lt;/p&gt;

&lt;p&gt;→ You find out from customer complaints&lt;br&gt;
→ You find out from someone on Twitter&lt;br&gt;
→ You find out from Downdetector — which, by the way, &lt;a href="https://www.tomsguide.com/news/live/cloudfare-outage-november-2025-x-chatgpt" rel="noopener noreferrer"&gt;went down during the November 2025 Cloudflare outage&lt;/a&gt;. The site that tracks outages &lt;em&gt;couldn't track the outage.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I'm not making this up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fea2kastcmywuerqijwfs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fea2kastcmywuerqijwfs.png" alt="A chain of API dependency nodes with one cracked and glowing red, causing a ripple of warnings" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  It's getting worse
&lt;/h2&gt;

&lt;p&gt;This isn't a pessimism play. It's math.&lt;/p&gt;

&lt;p&gt;Every major cloud provider is pouring money into GPU clusters and AI infrastructure. &lt;a href="https://www.forrester.com/blogs/predictions-2026-cloud-outages-private-ai-on-private-clouds-and-the-rise-of-the-neoclouds/" rel="noopener noreferrer"&gt;Forrester&lt;/a&gt; — not some blogger, an actual analyst firm willing to stake their reputation on it — predicted at least two major multi-day hyperscaler outages in 2026. The reason? Legacy systems getting neglected while everyone races to ship the next model. &lt;a href="https://www.techtarget.com/searchcloudcomputing/feature/Cloud-outages-expected-to-be-the-new-normal-in-2026" rel="noopener noreferrer"&gt;TechTarget covered it too&lt;/a&gt; — cloud outages are expected to be the new normal.&lt;/p&gt;

&lt;p&gt;And we're already seeing it. In the first three months of 2026 alone:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anthropic (March 17-18)&lt;/strong&gt; — Claude went down hard. API 500 errors everywhere. Chat, coding tools, mobile — all dead. &lt;a href="https://www.ibtimes.co.uk/anthropic-claude-ai-global-disruption-1786601" rel="noopener noreferrer"&gt;IBTimes reported&lt;/a&gt; on the global disruption. Developers on Twitter called it &lt;a href="https://www.techradar.com/news/live/claude-anthropic-down-outage-march-11-2026" rel="noopener noreferrer"&gt;"a snow day"&lt;/a&gt;. &lt;a href="https://statusgator.com/services/anthropic/claudeai" rel="noopener noreferrer"&gt;StatusGator tracked 9+ hours of downtime on March 18 alone&lt;/a&gt;. Earlier in February, StatusGator detected an Anthropic outage &lt;a href="https://statusgator.com/blog/february-2026-early-warning-signals/" rel="noopener noreferrer"&gt;1 hour and 47 minutes&lt;/a&gt; before Anthropic acknowledged it on their own status page. Almost two hours of "All Systems Operational" while nothing was operational.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub (February)&lt;/strong&gt; — &lt;a href="https://github.blog/news-insights/company-news/github-availability-report-february-2026/" rel="noopener noreferrer"&gt;Six separate incidents in one month&lt;/a&gt;. The worst one took out GitHub Actions and Codespaces for nearly six hours. Then on &lt;a href="https://www.githubstatus.com" rel="noopener noreferrer"&gt;March 18, webhook deliveries&lt;/a&gt; went from ~5 second latency to 160 seconds. That's a 32x spike. On the same day Cloudflare was also having issues. Fun times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare (March 18)&lt;/strong&gt; — &lt;a href="https://www.cloudflarestatus.com/" rel="noopener noreferrer"&gt;Pages API, Pages Builds, and Workers Builds&lt;/a&gt; all hit with 500s and build failures. If you were deploying that day, you weren't deploying that day. This came after the &lt;a href="https://linuxblog.io/cloudflare-outage-nov-18-2025/" rel="noopener noreferrer"&gt;November 2025 mega-outage&lt;/a&gt; that took out X, Spotify, ChatGPT, Canva, Discord, and dozens more — and a &lt;a href="https://www.datamation.com/networks/cloudflare-outage-dec-2025/" rel="noopener noreferrer"&gt;second outage in December&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stripe&lt;/strong&gt; — &lt;a href="https://statusgator.com/services/stripe" rel="noopener noreferrer"&gt;13 incidents in the last 90 days. Three major outages.&lt;/a&gt; The payment processor. Thirteen incidents in three months.&lt;/p&gt;

&lt;p&gt;AI coding agents are &lt;a href="https://blog.bytebytego.com/p/how-stripes-minions-ship-1300-prs" rel="noopener noreferrer"&gt;shipping 1,300 PRs a week at Stripe&lt;/a&gt; now. That's incredible. It's also 1,300 chances per week for something subtle to break. The faster everyone ships, the more things break. The more things break, the more you need to know about it before your users do.&lt;/p&gt;

&lt;p&gt;I'm not anti-AI — I literally built CanaryOps using Claude Code. But the tool that helps me build faster is also the tool that &lt;a href="https://status.claude.com/" rel="noopener noreferrer"&gt;goes down at 11 PM on a Tuesday&lt;/a&gt;. The irony isn't lost on me.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "I'll just build it" moment
&lt;/h2&gt;

&lt;p&gt;Every developer has this moment. The sensible voice in your head says "there's probably a tool for this." You google it. There isn't. Well, there is — but it costs $300/month and requires a sales call and a 47-slide onboarding deck.&lt;/p&gt;

&lt;p&gt;So you do what developers do.&lt;/p&gt;

&lt;p&gt;You open your terminal at 11 PM on a Tuesday and whisper "how hard can it be."&lt;/p&gt;

&lt;p&gt;(Narrator: it was not that hard.)&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CanaryOps&lt;/strong&gt; — monitors the APIs you depend on. Not your site. Your &lt;em&gt;dependencies&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The idea is dead simple: ping Stripe, OpenAI, Twilio, Cloudflare, SendGrid, and about 10 other common APIs at regular intervals. When something starts returning errors or degrading — email alert. Before the status page updates. Before the user complains. Before you find out from a meme on Twitter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Next.js (App Router) + TypeScript&lt;/li&gt;
&lt;li&gt;Supabase for auth and database&lt;/li&gt;
&lt;li&gt;Resend for email alerts&lt;/li&gt;
&lt;li&gt;Vercel cron for the check engine&lt;/li&gt;
&lt;li&gt;Lemon Squeezy for billing (because Stripe wanted me to incorporate a company just to accept payments, which is a whole different rant)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbw0nzo9zqiqhwyu1o9ud.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbw0nzo9zqiqhwyu1o9ud.png" alt="CanaryOps dashboard" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sign up, pick the APIs you use from a preset list (or add any custom URL)&lt;/li&gt;
&lt;li&gt;CanaryOps pings them at regular intervals&lt;/li&gt;
&lt;li&gt;Something's off → you get an email&lt;/li&gt;
&lt;li&gt;You now have a 5-minute head start over your users&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. No Kubernetes. No Grafana dashboards that require a PhD to read. No 47-slide onboarding deck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why email and not Slack/Discord/SMS
&lt;/h2&gt;

&lt;p&gt;Because email is the cockroach of the internet. SMTP has been running since 1982. It survived the dot-com bust, the move to mobile, the move to cloud, and every Cloudflare outage that takes out half the web. Your inbox still works when everything else is on fire.&lt;/p&gt;

&lt;p&gt;Yes, I use Resend to send the alerts. Yes, Resend could also go down. Yes, I'm aware of the irony of monitoring dependencies while depending on dependencies.&lt;/p&gt;

&lt;p&gt;It's dependencies all the way down.&lt;/p&gt;

&lt;p&gt;At some point you pick a layer you trust and stand on it. I'm standing on email. It's been working for 43 years. I like those odds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part where I ask for your help
&lt;/h2&gt;

&lt;p&gt;CanaryOps is live. Free tier gives you 10 monitors with 5-minute checks and email alerts. No credit card, no trial that expires, no "contact sales."&lt;/p&gt;

&lt;p&gt;Pro is $9/month for 50 monitors and 1-minute checks. That's less than the mass-produced sandwich you bought for lunch.&lt;/p&gt;

&lt;p&gt;I'm a solo dev. No funding, no team, no marketing department (as you can probably tell from this post). I built this because I needed it, and I'm betting other developers need it too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I actually want from you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try it → &lt;a href="https://canaryops.dev" rel="noopener noreferrer"&gt;canaryops.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Tell me what's broken&lt;/li&gt;
&lt;li&gt;Tell me what APIs to add to the preset list&lt;/li&gt;
&lt;li&gt;Tell me if the value prop makes sense or if I'm yelling into the void&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The thesis
&lt;/h2&gt;

&lt;p&gt;The question isn't whether your dependencies will go down. They will. &lt;a href="https://www.forrester.com/blogs/predictions-2026-cloud-outages-private-ai-on-private-clouds-and-the-rise-of-the-neoclouds/" rel="noopener noreferrer"&gt;More often than last year, and the year before that.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The question is whether you'll know before your Slack lights up with user complaints.&lt;/p&gt;

&lt;p&gt;Status pages are self-reported. They update when providers decide to admit something's wrong. That's not monitoring — that's PR.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7uuf83wra1hpikty8xt8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7uuf83wra1hpikty8xt8.png" alt="" width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You deserve to know before the status page does.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://canaryops.dev" rel="noopener noreferrer"&gt;canaryops.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>saas</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
