xzawed

Posted on May 23

Railway Major Outage — Turns Out Google Cloud Pulled the Plug

#railway #devops #cloudinfrastructure #platformengineering

On May 19, 2026, a routine deployment turned into a deep dive into one of the most revealing cloud outages of the year.

It Started With a Login Error

I was trying to access the Railway dashboard when this appeared:

Error authenticating with GitHub
Problem completing OAuth login

My first instinct: something's wrong with my OAuth setup. Maybe a mismatched redirect URI, an expired Client Secret, or a permissions issue on the GitHub side.

So I opened DevTools.

What DevTools Showed

The console was not encouraging:

Uncaught Error: Minified React error #418
Uncaught Error: Minified React error #423
Failed to load resource: net::ERR_NAME_NOT_RESOLVED

React runtime errors plus a DNS resolution failure. This wasn't a configuration issue on my end.

There was also this in the console — ironically:

************************************
*                                  *
*        Enjoying Railway?         *
*                                  *
*       Want to ride with us?      *
*                                  *
*           Hop aboard,            *
*  the train is leaving the station!
*                                  *
*   https://railway.com/careers    *
*                                  *
************************************

A hiring pitch. Mid-outage.

Checking the Status Page

I pulled up status.railway.com. The banner said it all: Major Outage.

Affected Services

Core infrastructure — all down:

Dashboard (railway.com)
GitHub Login / Google Login
DNS
Traffic routing
GCP Regions — all down:
US East (Virginia)
US West (Oregon)
EU West (Amsterdam)
Southeast Asia (Singapore)
Railway Metal Regions — all down:
US East, US West, EU West, Southeast Asia
Build and observability — all down:
Build Machines (GCP + Metal)
Image Registry (GCP + Metal)
Metrics, Logs, Central Station
This wasn't a partial degradation. Near-total collapse.

Incident Timeline (UTC)

May 19, 22:20 — Google Cloud transitions Railway account to "restricted"

May 19, 22:29 — Railway begins investigating widespread disruption

May 19, 22:43 — Upstream cloud provider access issue identified

May 19, 23:37 — Google Cloud account block officially acknowledged

May 20, 01:34 — GCP compute recovered; networking still broken on Google's side

May 20, 01:41 — Railway Metal gradually recovering; non-enterprise builds throttled

May 20, 03:05 — Non-enterprise deploys paused; enterprise deploys unaffected

May 20, 04:58 — Deploys possible again; GCP workloads still intermittent

Total impact window: roughly 6 hours for most users.

The Real Cause: Google Cloud Suspended Railway Without Warning

The Register's headline framed it perfectly:

"Google Cloud suspended major customer Railway.com without cause, causing outage"

Railway was spending $10M+ per year on Google Cloud — a significant enterprise customer by any measure. Yet their account was placed into a restricted state with no prior notice. VMs, CloudSQL instances, and APIs were effectively removed. Once the network route cache expired, the impact cascaded to every service on the platform.

From Railway's solution engineer, speaking to The Register:

"It looked like company resources had been deleted — they appeared not to exist."

Railway CEO Jake Cooper was more direct in the post-incident writeup:

"This incident was particularly severe because there was a single, predictable failure point: the cloud account appearing to be deleted. Every person whose business we took offline for roughly six hours has every right to be upset."

Was This the First Time?

No.

Railway had already experienced a similar situation from Google Cloud in 2024 — described internally as posing an "existential threat to the business." That's precisely why they had been migrating workloads to their own colocation infrastructure, Railway Metal. Yet the migration wasn't complete, and the dependency remained.

A comparable event occurred again in 2025. May 2026 was the third time.

Fault Analysis

Google Cloud (~80% fault):

Suspended account with no prior notice
Provided no stated reason
Pattern repeated across multiple years
Railway (~20% fault):
Incomplete migration away from GCP despite two prior incidents
Single failure point remained in the architecture
Non-enterprise workloads deprioritized during recovery
Railway CEO acknowledged this directly:

"We are responsible for your uptime."

Railway By the Numbers (2026)

Founded: 2020, San Francisco, USA
CEO: Jake Cooper (ex-Wolfram Alpha, Bloomberg, Uber)
Total funding raised: $120M (Series B $100M closed January 2026)
Registered developers: 2,000,000+
Monthly new developers: ~200,000
Monthly deployments: 10,000,000+
Annual GCP spend: $10M+
Global PaaS market share: 0.8% (10th)
Team size: ~30 people Railway is growing fast. The product is genuinely good. The developer experience is arguably the best in the category. But the infrastructure reliability story has a recurring problem.

What This Reveals About the PaaS Ecosystem

1. Big Cloud dependency is a structural risk at any scale.

AWS, Azure, and GCP collectively control roughly 63% of the global cloud market. Every PaaS platform that doesn't own its own infrastructure is, to some degree, at the mercy of these providers. Railway Metal was supposed to fix this. It didn't move fast enough.

2. Automated systems don't care about customer tier.

$10M/year doesn't buy you immunity from an automated account suspension trigger. When Google's systems flag an account, the playbook runs — regardless of revenue, regardless of the downstream impact on 2 million developers.

3. Non-enterprise users are second-class customers during recovery.

The status updates made this explicit:

"Non-enterprise deploys remain paused; enterprise deploys are unaffected."

If you're on a Hobby or Pro plan, you're lower priority during a crisis. Worth knowing before you put a production service on any PaaS.

Alternative Platforms Worth Considering

Fly.io

35+ global regions, edge deployment, fine-grained infrastructure control
Best for: performance-critical workloads, global apps Render
Stable, predictable pricing, solid managed PostgreSQL
Best for: teams migrating from Railway quickly DigitalOcean App Platform
Mature, transparent monthly pricing, clear path to Droplets/Kubernetes
Best for: long-term production workloads All three have lower incident frequency than Railway over the past 18 months.

Takeaways

Check incident history before committing to a platform.
Growth metrics and DX scores don't tell you how a platform behaves when things go wrong. The Railway status page history tells a clearer story than any benchmark.

Single cloud dependency is a risk regardless of mitigation promises.
If the platform you rely on is still routing significant workloads through one cloud provider, your uptime is partially their problem — and partially their cloud provider's problem.

Big Cloud is structurally oligopolistic.
This isn't a Railway-specific issue. It's the nature of building on top of infrastructure you don't control. Railway is taking steps to own their stack. Until that's complete, the risk remains.

Final Note

This post isn't a hit piece on Railway. The product solves a real problem well, and the team's post-incident transparency was better than most. But the May 19 outage is a useful case study in the hidden dependencies behind any "it just works" deployment platform.

Build accordingly.

Written based on direct experience during the Railway Major Outage, May 19–20, 2026.
Sources: Railway status page, The Register, Railway CEO post-incident statement.

DEV Community