The railway went down for 10 hours, and it wasn't their fault. Here's the part nobody is talking about.

#devops #cloud #sre #postmortem

22:10 UTC. May 19, 2026.

The railway's monitoring starts screaming.

Dashboard, 503. API, dead. Logins, failing. Within nine minutes, the on-call engineers have an answer, and honestly, it's almost worse than an outage:

Google Cloud suspended Railway's entire production account.

No warning. No email. No phone call. Just an automated enforcement action that flipped a switch on a company that spends over ten million dollars a year with them.

I put together a short breakdown of what actually happened, and walked through how we'd have spotted this kind of single-point-of-failure on the architecture canvas with Blast Radius before it bit. If you want the visual version of this post, it's here:

If you've been on the internet long enough, you've seen this movie before. UniSuper, a $50B pension fund, was accidentally deleted by GCP in 2024. Plenty of indie devs are auto-banned by AWS and Google with zero recourse. The Railway one just happens to be the biggest "developer cloud gets locked out of the cloud" event so far.

But the part that got under my skin wasn't the suspension. It was what happened next.

Railway isn't even fully on GCP

Here's what makes this incident actually interesting for anyone running infra.

Railway runs workloads on three things:

Their own bare metal hardware (Railway Metal)
AWS
GCP

Smart. Multi-cloud. Exactly what every architecture deck on LinkedIn tells you to do.

But their network control plane, the thing that knows where everything lives and how to route traffic, was hosted on GCP. So when GCP suspended the account at 22:20 UTC:

22:20: control plane goes down
22:35: the routing cache at the edge starts expiring
~23:35: Railway Metal workloads start returning 404s
shortly after: AWS workloads do the same

By the time the routing cache fully expired, every single workload across every cloud was unreachable. Even the ones running on hardware that the railway owns outright, sitting in their own racks, are completely untouched by Google's enforcement action.

The servers were fine. The applications were fine. Nobody could find them.

That's the blast radius of one upstream click.

From the railway's own postmortem, which is unusually honest and worth reading in full:

"In this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud. This meant that despite the mesh continuing to operate for an hour, when the route cache expired, the mesh failed to re-populate the routing tables."

And from Angelo on the Railway forum:

"This one was egregiously bad because it was a single and expected point of failure like a cloud account getting removed… to say we are livid is an understatement."

Account access came back in 9 minutes after a P0 escalation. But the customer-facing outage ran nearly 10 hours total, because once the edges have forgotten where everything lives, you have to wake up disks, restart compute, rebuild routes, and re-converge the mesh layer by layer. The technical recovery alone took the better part of 8 hours after access was restored. And then GitHub starts rate-limiting your OAuth during the recovery surge, because of course it does.

The thing every engineer felt reading this

If you write infrastructure for a living, you read the Railway postmortem with one specific feeling, and it's not schadenfreude.

It's the cold realization that you don't actually know what depends on what in your own stack.

You know the obvious stuff. The big "if RDS goes down, the app goes down" connections. But the long tail? The security group that's quietly shared between four services? Is the Lambda the only thing keeping a webhook alive? Is the " idle " read replica actually the cross-region failover for orders?

You don't know. I don't know. Nobody knows until the thing breaks.

This is the gap nobody fills. Cost tools are great at telling you what's wasted. Observability tools are great at telling you what's broken right now. Neither one tells you what will break if you touch this.

So teams do one of two things:

Delete it and pray.
Don't delete it, and sit on thousands of dollars of monthly waste because nobody wants to be the person who broke checkout on a Friday.

That second one is the actual norm, by the way. Talk to any cloud cost lead, and they'll tell you the bottleneck isn't finding the savings. It's getting anyone to confidently apply the recommendations.

What we built (and why Railway's story is exactly the use case)

This is the gap we've been building Blast Radius for in ZopNight.

The idea is dead simple. Before you apply any recommendation (delete this, stop that, modify the other thing), Blast Radius lights up the architecture canvas and shows you, in plain language, what's actually connected and what's about to break. (If you watched the video above, you've already seen the canvas in action: the RDS read replica that looked idle but was actually the cross-region failover for production orders. That's the kind of thing this catches.)

Here's how it works under the hood:

Adjacency graph. We build it from the metadata you already have: shared security groups, load balancer targets, volume attachments, Lambda triggers, and parent resources. No agents. No flow logs. Just the same metadata you can see in the AWS / GCP / Azure consoles, stitched into a real graph.

Resource-type-aware impact classification. A modification on a Lambda is not the same as a modification on an EC2 instance, which is not the same as a modification on a GKE node pool. We classified 131 resource types into three behavior buckets: onlineModify (live update, no interruption), restartModify (brief restart), and poolModify (children recreated). The classification respects that. Default for unmapped types is Warning, not Safe. Safe has to be proven.

Color on the canvas. Green = safe. Amber = a connection will break. Red = destroyed or non-functional. Everything unrelated gets dimmed so your eyes can find the actual story.

A risk score, 0 to 100. Severity of impact, whether it's prod/staging/dev, how many teams own pieces of it, and whether it's on an active schedule. Not a vibe. A number that explains itself.

So when you click "delete this idle RDS read replica," and Blast Radius lights up red on a cross-region link you forgot existed, you don't have to be the person who broke DR on a Friday. You loop in the right team, or you skip that one and confidently apply the eleven others that came back green.

If Railway had been able to look at their own architecture and ask "what happens if our GCP control plane disappears for 10 hours", and see the answer light up across Metal and AWS in red, they'd have ripped that dependency out a year ago. They are now, by the way. The postmortem commits to:

Removing GCP from the data plane's hot path
A true mesh across Metal, AWS, and GCP where any one interconnect can fail
HA database shards split across AWS and Metal, so quorum survives losing a cloud

The lesson the rest of us pay nothing for

You can't prevent your cloud provider's billing department from having a bad day. You can't make their fraud algorithms call you first.

What you can do is make sure that when one upstream blinks, you already know, exactly, with receipts, which of your own resources will go dark with it.

The internet runs on three companies. Every once in a while, one of them hits the off switch. The question isn't whether it happens to your stack. It's whether you'll be able to point at a canvas and say, "I already knew that would fail. Here's what I did about it."

Visibility is always cheaper than recovery.