<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: bishwas jha</title>
    <description>The latest articles on DEV Community by bishwas jha (@alphacrack).</description>
    <link>https://dev.to/alphacrack</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2629801%2F4f499ed0-1048-45f5-9588-0b27ee608adc.png</url>
      <title>DEV Community: bishwas jha</title>
      <link>https://dev.to/alphacrack</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alphacrack"/>
    <language>en</language>
    <item>
      <title>5 things Railway’s 8 hour outage should change about how you think about redundancy</title>
      <dc:creator>bishwas jha</dc:creator>
      <pubDate>Fri, 22 May 2026 08:03:03 +0000</pubDate>
      <link>https://dev.to/alphacrack/5-things-railways-8-hour-outage-should-change-about-how-you-think-about-redundancy-1k5l</link>
      <guid>https://dev.to/alphacrack/5-things-railways-8-hour-outage-should-change-about-how-you-think-about-redundancy-1k5l</guid>
      <description>&lt;p&gt;Railway runs on Google Cloud, AWS, and its own metal.&lt;/p&gt;

&lt;p&gt;So when I first saw that Railway was down for hours, my first thought was probably the same as yours.&lt;/p&gt;

&lt;p&gt;"How does a multi cloud platform go dark like that?"&lt;/p&gt;

&lt;p&gt;Then I read the incident report, the Hacker News discussion, and the follow up coverage. And the real lesson is uncomfortable.&lt;/p&gt;

&lt;p&gt;This was not really a cloud outage.&lt;/p&gt;

&lt;p&gt;The servers did not all die. AWS did not die. Railway Metal did not die. Google Cloud infrastructure itself did not have to collapse.&lt;/p&gt;

&lt;p&gt;What failed was much higher up the stack.&lt;/p&gt;

&lt;p&gt;The account.&lt;/p&gt;

&lt;p&gt;Google Cloud placed Railway's production account into suspended status incorrectly as part of an automated action. Railway says this happened around 22:20 UTC on May 19, and the platform was not fully recovered until the next morning. (&lt;a href="https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage" rel="noopener noreferrer"&gt;https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;That should make every CloudOps, platform, SRE, and engineering leader stop for a minute.&lt;/p&gt;

&lt;p&gt;Because most redundancy plans are built for the wrong failure.&lt;/p&gt;

&lt;p&gt;We design for dead VMs.&lt;br&gt;
We design for unavailable zones.&lt;br&gt;
We design for regional failover.&lt;br&gt;
We design for database replicas.&lt;/p&gt;

&lt;p&gt;But what do we do when the provider says, incorrectly or automatically, “your account is no longer allowed to exist normally”?&lt;/p&gt;

&lt;p&gt;Not much, usually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. This was not a cloud outage. It was an account suspension&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is the first big lesson.&lt;/p&gt;

&lt;p&gt;A lot of people hear "cloud outage" and instantly think of regions, zones, load balancers, or broken hardware. But Railway’s case was different.&lt;/p&gt;

&lt;p&gt;Google Cloud's automated systems suspended Railway's production account. Railway says this was incorrect, and that the action was part of a wider automated event affecting many accounts. (&lt;a href="https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage" rel="noopener noreferrer"&gt;https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;That kind of failure does not look like a server going unhealthy.&lt;/p&gt;

&lt;p&gt;It looks like identity, billing, trust, abuse detection, policy, support, and account control all becoming part of your availability story.&lt;/p&gt;

&lt;p&gt;Your health checks can say everything is fine.&lt;/p&gt;

&lt;p&gt;Your multi zone architecture can be green.&lt;/p&gt;

&lt;p&gt;Your workloads can still technically exist.&lt;/p&gt;

&lt;p&gt;But if the account is restricted, your beautiful infrastructure diagram does not matter much.&lt;/p&gt;

&lt;p&gt;This is the part many teams do not model.&lt;/p&gt;

&lt;p&gt;They model "what if eu west 1 is down?"&lt;/p&gt;

&lt;p&gt;They rarely model "what if our production cloud account is frozen by an automated system at 11 PM?"&lt;/p&gt;

&lt;p&gt;And honestly, that second one is scarier.&lt;/p&gt;

&lt;p&gt;Because you do not debug it with kubectl.&lt;/p&gt;

&lt;p&gt;You debug it with support tickets, escalation paths, account managers, legal trust, and luck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The control plane was the real single point of failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Railway had workloads on AWS and Railway Metal that were still running during the incident. But users still saw errors.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because the routing control plane was hosted on Google Cloud.&lt;/p&gt;

&lt;p&gt;Railway's edge proxies needed that control plane to know where workloads lived. They had cached route data for a while, but once the cache expired, the edge could not keep routing properly. Railway's community update said route cache expiry caused the incident to spread beyond GCP hosted workloads and affect the wider platform. (&lt;a href="https://station.railway.com/community/what-we-know-so-far-may-19th-2026-86354cdd" rel="noopener noreferrer"&gt;https://station.railway.com/community/what-we-know-so-far-may-19th-2026-86354cdd&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;This is the second lesson.&lt;/p&gt;

&lt;p&gt;Your data plane can be redundant while your control plane is still fragile.&lt;/p&gt;

&lt;p&gt;And this is where a lot of "multi cloud" thinking becomes a little fake.&lt;/p&gt;

&lt;p&gt;You can run compute in three places.&lt;br&gt;
You can run storage in two places.&lt;br&gt;
You can have Kubernetes clusters everywhere.&lt;/p&gt;

&lt;p&gt;But if the scheduler, routing map, identity service, deployment API, config database, or certificate automation lives in one provider, your multi cloud story may only be multi cloud on paper.&lt;/p&gt;

&lt;p&gt;The thing customers see as "the product" is often not the workload.&lt;/p&gt;

&lt;p&gt;It is the control plane around the workload.&lt;/p&gt;

&lt;p&gt;For Railway, customers were not just buying raw compute. They were buying routing, builds, deployments, dashboard access, APIs, orchestration and platform magic.&lt;/p&gt;

&lt;p&gt;And the platform magic had a dependency.&lt;/p&gt;

&lt;p&gt;That dependency became the outage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Getting the account back is not the same as getting the service back&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This one is very important.&lt;/p&gt;

&lt;p&gt;According to Railway, Google reversed the suspension shortly after escalation. But recovery still took hours because account restoration did not automatically bring everything back cleanly. Persistent disks, compute instances, networking and orchestration layers had to be restored and verified step by step. (&lt;a href="https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage" rel="noopener noreferrer"&gt;https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;This is the part people underestimate.&lt;/p&gt;

&lt;p&gt;A provider can say, “access restored.”&lt;/p&gt;

&lt;p&gt;But your system still has to wake up.&lt;/p&gt;

&lt;p&gt;Disks need to attach.&lt;br&gt;
Networks need to behave.&lt;br&gt;
Queues need to drain.&lt;br&gt;
Deployments need to stop stampeding.&lt;br&gt;
Databases need to agree again.&lt;br&gt;
Caches need to be repopulated.&lt;br&gt;
Humans need to verify what is safe.&lt;/p&gt;

&lt;p&gt;That is not instant.&lt;/p&gt;

&lt;p&gt;And in a complex platform, bringing things back too fast can be worse than bringing them back slowly.&lt;/p&gt;

&lt;p&gt;Railway also throttled queued deploys during recovery, which sounds boring, but it is actually the responsible move. Because after an outage, your own backlog becomes traffic. And that traffic can flatten the recovering system.&lt;/p&gt;

&lt;p&gt;So the real RTO is not:&lt;/p&gt;

&lt;p&gt;"How fast can the provider undo the mistake?"&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;p&gt;"How fast can we safely restore the whole chain after the provider undo the mistake?"&lt;/p&gt;

&lt;p&gt;Small difference in words.&lt;/p&gt;

&lt;p&gt;Huge difference in reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Recovery can create a second outage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is probably my favorite lesson from the whole incident, because it is so real.&lt;/p&gt;

&lt;p&gt;When Railway started recovering, queued retries and user activity came back in a burst. That burst hit GitHub OAuth and webhook flows hard enough that GitHub rate limited Railway. So logins and builds had problems again, even after the original Google Cloud issue was no longer the main blocker. (&lt;a href="https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage" rel="noopener noreferrer"&gt;https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;That is painful.&lt;/p&gt;

&lt;p&gt;The first outage came from one provider.&lt;/p&gt;

&lt;p&gt;The second problem appeared during recovery, from another dependency.&lt;/p&gt;

&lt;p&gt;This happens more often than teams admit.&lt;/p&gt;

&lt;p&gt;After an outage, everything tries to catch up.&lt;/p&gt;

&lt;p&gt;Cron jobs wake up.&lt;br&gt;
Webhooks retry.&lt;br&gt;
CI pipelines restart.&lt;br&gt;
Users refresh dashboards.&lt;br&gt;
Workers pull old messages.&lt;br&gt;
Integrations suddenly see a wall of traffic.&lt;/p&gt;

&lt;p&gt;And then some other system says, “this looks abusive.”&lt;/p&gt;

&lt;p&gt;Now your recovery has become its own incident.&lt;/p&gt;

&lt;p&gt;This is why serious resilience is not just failover.&lt;/p&gt;

&lt;p&gt;It is controlled recovery.&lt;/p&gt;

&lt;p&gt;Backpressure matters.&lt;br&gt;
Retry budgets matter.&lt;br&gt;
Queue draining matters.&lt;br&gt;
Circuit breakers matter.&lt;br&gt;
Rate limit awareness matters.&lt;br&gt;
Runbooks matter.&lt;/p&gt;

&lt;p&gt;And boring old institutional memory matters even more.&lt;/p&gt;

&lt;p&gt;Railway had already hardened parts of the GitHub rate limit path after a prior incident, which helped reduce damage this time. That is not luck. That is the value of learning properly from past pain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Most teams insure the wrong half of the risk&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Railway incident is not the first time account level cloud risk became real.&lt;/p&gt;

&lt;p&gt;In 2024, UniSuper, a major Australian pension fund, had a serious Google Cloud incident where its private cloud environment was deleted because of a misconfiguration. Google later published details saying backups in Google Cloud Storage and third party backup software helped restoration. (&lt;a href="https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident" rel="noopener noreferrer"&gt;https://cloud.google.com/blog/products/infrastructure/details-of-google-cloud-gcve-incident&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;So no, account level and provider control plane risk is not some imaginary edge case.&lt;/p&gt;

&lt;p&gt;It happens.&lt;/p&gt;

&lt;p&gt;But most companies still talk about redundancy like this:&lt;/p&gt;

&lt;p&gt;"We use multiple clouds."&lt;/p&gt;

&lt;p&gt;Ok, but what does that mean?&lt;/p&gt;

&lt;p&gt;Does it mean workloads can run somewhere else?&lt;/p&gt;

&lt;p&gt;Or does it mean you can actually operate the business if one provider account disappears?&lt;/p&gt;

&lt;p&gt;Those are very different things.&lt;/p&gt;

&lt;p&gt;Flexera's 2026 State of the Cloud report shows multi cloud is still a major enterprise pattern, and its report is based on 753 cloud decision makers. (&lt;a href="https://info.flexera.com/CM-REPORT-State-of-the-Cloud?lead_source=Organic+Search" rel="noopener noreferrer"&gt;https://info.flexera.com/CM-REPORT-State-of-the-Cloud?lead_source=Organic+Search&lt;/a&gt;) But in practice, many companies are multi cloud for procurement, politics, analytics, or workload placement.&lt;/p&gt;

&lt;p&gt;Not always for true survivability.&lt;/p&gt;

&lt;p&gt;True survivability asks much harder questions.&lt;/p&gt;

&lt;p&gt;Can we deploy without this provider?&lt;br&gt;
Can we route without this provider?&lt;br&gt;
Can we authenticate without this provider?&lt;br&gt;
Can we restore backups without this provider?&lt;br&gt;
Can we contact support fast enough?&lt;br&gt;
Can we prove ownership if an automated trust system flags us?&lt;br&gt;
Can we keep serving read only traffic if the control plane dies?&lt;br&gt;
Can we rebuild from another account, another org, or another provider?&lt;/p&gt;

&lt;p&gt;That is not as sexy as "active active multi cloud."&lt;/p&gt;

&lt;p&gt;But it is probably more useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real takeaway&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Railway did have redundancy.&lt;/p&gt;

&lt;p&gt;Just not for the layer that failed.&lt;/p&gt;

&lt;p&gt;And that is the uncomfortable lesson for the rest of us.&lt;/p&gt;

&lt;p&gt;Redundancy at the compute layer does not protect you from account suspension.&lt;/p&gt;

&lt;p&gt;Multi region databases do not protect you from provider level identity actions.&lt;/p&gt;

&lt;p&gt;Healthy servers do not help when routing control planes cannot tell traffic where to go.&lt;/p&gt;

&lt;p&gt;And getting your cloud account back does not mean your service is back.&lt;/p&gt;

&lt;p&gt;The next resilience review should not only ask:&lt;/p&gt;

&lt;p&gt;"What happens if a region dies?"&lt;/p&gt;

&lt;p&gt;It should also ask:&lt;/p&gt;

&lt;p&gt;"What happens if our cloud provider suspends our production account by mistake tonight?"&lt;/p&gt;

</description>
      <category>aws</category>
      <category>runway</category>
      <category>architecture</category>
      <category>gcp</category>
    </item>
  </channel>
</rss>
