DEV Community: xzawed

Railway Major Outage — Turns Out Google Cloud Pulled the Plug

xzawed — Sat, 23 May 2026 14:01:38 +0000

On May 19, 2026, a routine deployment turned into a deep dive into one of the most revealing cloud outages of the year.

It Started With a Login Error

I was trying to access the Railway dashboard when this appeared:

Error authenticating with GitHub
Problem completing OAuth login

My first instinct: something's wrong with my OAuth setup. Maybe a mismatched redirect URI, an expired Client Secret, or a permissions issue on the GitHub side.

So I opened DevTools.

What DevTools Showed

The console was not encouraging:

Uncaught Error: Minified React error #418
Uncaught Error: Minified React error #423
Failed to load resource: net::ERR_NAME_NOT_RESOLVED

React runtime errors plus a DNS resolution failure. This wasn't a configuration issue on my end.

There was also this in the console — ironically:

************************************
*                                  *
*        Enjoying Railway?         *
*                                  *
*       Want to ride with us?      *
*                                  *
*           Hop aboard,            *
*  the train is leaving the station!
*                                  *
*   https://railway.com/careers    *
*                                  *
************************************

A hiring pitch. Mid-outage.

Checking the Status Page

I pulled up status.railway.com. The banner said it all: Major Outage.

Affected Services

Core infrastructure — all down:

Dashboard (railway.com)
GitHub Login / Google Login
DNS
Traffic routing
GCP Regions — all down:
US East (Virginia)
US West (Oregon)
EU West (Amsterdam)
Southeast Asia (Singapore)
Railway Metal Regions — all down:
US East, US West, EU West, Southeast Asia
Build and observability — all down:
Build Machines (GCP + Metal)
Image Registry (GCP + Metal)
Metrics, Logs, Central Station
This wasn't a partial degradation. Near-total collapse.

Incident Timeline (UTC)

May 19, 22:20 — Google Cloud transitions Railway account to "restricted"

May 19, 22:29 — Railway begins investigating widespread disruption

May 19, 22:43 — Upstream cloud provider access issue identified

May 19, 23:37 — Google Cloud account block officially acknowledged

May 20, 01:34 — GCP compute recovered; networking still broken on Google's side

May 20, 01:41 — Railway Metal gradually recovering; non-enterprise builds throttled

May 20, 03:05 — Non-enterprise deploys paused; enterprise deploys unaffected

May 20, 04:58 — Deploys possible again; GCP workloads still intermittent

Total impact window: roughly 6 hours for most users.

The Real Cause: Google Cloud Suspended Railway Without Warning

The Register's headline framed it perfectly:

"Google Cloud suspended major customer Railway.com without cause, causing outage"

Railway was spending $10M+ per year on Google Cloud — a significant enterprise customer by any measure. Yet their account was placed into a restricted state with no prior notice. VMs, CloudSQL instances, and APIs were effectively removed. Once the network route cache expired, the impact cascaded to every service on the platform.

From Railway's solution engineer, speaking to The Register:

"It looked like company resources had been deleted — they appeared not to exist."

Railway CEO Jake Cooper was more direct in the post-incident writeup:

"This incident was particularly severe because there was a single, predictable failure point: the cloud account appearing to be deleted. Every person whose business we took offline for roughly six hours has every right to be upset."

Was This the First Time?

No.

Railway had already experienced a similar situation from Google Cloud in 2024 — described internally as posing an "existential threat to the business." That's precisely why they had been migrating workloads to their own colocation infrastructure, Railway Metal. Yet the migration wasn't complete, and the dependency remained.

A comparable event occurred again in 2025. May 2026 was the third time.

Fault Analysis

Google Cloud (~80% fault):

Suspended account with no prior notice
Provided no stated reason
Pattern repeated across multiple years
Railway (~20% fault):
Incomplete migration away from GCP despite two prior incidents
Single failure point remained in the architecture
Non-enterprise workloads deprioritized during recovery
Railway CEO acknowledged this directly:

"We are responsible for your uptime."

Railway By the Numbers (2026)

Founded: 2020, San Francisco, USA
CEO: Jake Cooper (ex-Wolfram Alpha, Bloomberg, Uber)
Total funding raised: $120M (Series B $100M closed January 2026)
Registered developers: 2,000,000+
Monthly new developers: ~200,000
Monthly deployments: 10,000,000+
Annual GCP spend: $10M+
Global PaaS market share: 0.8% (10th)
Team size: ~30 people Railway is growing fast. The product is genuinely good. The developer experience is arguably the best in the category. But the infrastructure reliability story has a recurring problem.

What This Reveals About the PaaS Ecosystem

1. Big Cloud dependency is a structural risk at any scale.

AWS, Azure, and GCP collectively control roughly 63% of the global cloud market. Every PaaS platform that doesn't own its own infrastructure is, to some degree, at the mercy of these providers. Railway Metal was supposed to fix this. It didn't move fast enough.

2. Automated systems don't care about customer tier.

$10M/year doesn't buy you immunity from an automated account suspension trigger. When Google's systems flag an account, the playbook runs — regardless of revenue, regardless of the downstream impact on 2 million developers.

3. Non-enterprise users are second-class customers during recovery.

The status updates made this explicit:

"Non-enterprise deploys remain paused; enterprise deploys are unaffected."

If you're on a Hobby or Pro plan, you're lower priority during a crisis. Worth knowing before you put a production service on any PaaS.

Alternative Platforms Worth Considering

Fly.io

35+ global regions, edge deployment, fine-grained infrastructure control
Best for: performance-critical workloads, global apps Render
Stable, predictable pricing, solid managed PostgreSQL
Best for: teams migrating from Railway quickly DigitalOcean App Platform
Mature, transparent monthly pricing, clear path to Droplets/Kubernetes
Best for: long-term production workloads All three have lower incident frequency than Railway over the past 18 months.

Takeaways

Check incident history before committing to a platform.
Growth metrics and DX scores don't tell you how a platform behaves when things go wrong. The Railway status page history tells a clearer story than any benchmark.

Single cloud dependency is a risk regardless of mitigation promises.
If the platform you rely on is still routing significant workloads through one cloud provider, your uptime is partially their problem — and partially their cloud provider's problem.

Big Cloud is structurally oligopolistic.
This isn't a Railway-specific issue. It's the nature of building on top of infrastructure you don't control. Railway is taking steps to own their stack. Until that's complete, the risk remains.

Final Note

This post isn't a hit piece on Railway. The product solves a real problem well, and the team's post-incident transparency was better than most. But the May 19 outage is a useful case study in the hidden dependencies behind any "it just works" deployment platform.

Build accordingly.

Written based on direct experience during the Railway Major Outage, May 19–20, 2026.
Sources: Railway status page, The Register, Railway CEO post-incident statement.

I built a little harness around Claude Code 🛠️

xzawed — Sun, 03 May 2026 15:47:07 +0000

Hey 👋

So I've been using Claude Code for a couple months now and somewhere along the way I ended up building this little... setup? framework? "harness" feels about right. Basically a pile of files and conventions that make Claude behave consistently across my projects.

Gonna share what's in mine. Not because I think it's optimal — lol I'm like 2 months in — but I almost never see people post their actual setup. Maybe a piece of this is useful to you. Maybe you'll laugh at me. Either way 🤷

My (slightly weird?) philosophy 🤝

Three words: equality, coexistence, cooperation.

I know I know, that sounds dramatic for a dev tool 😅. What it actually means in practice: Claude does the writing/editing/deleting. I do the directing and decision-making. But — and I think this is the part most people miss — since Claude is the one doing the work, the environment has to be designed for Claude to work well in. Not just for me to skim.

If CLAUDE.md is unclear, that's my problem to fix. Not Claude's problem to silently work around.

So my whole methodology fits in one sentence: make a comfortable workspace for the executor.

What's in the harness 📦

1. `CLAUDE.md` — the brain dump

Every project has one. It's got:

What the project is and where it's at
Architecture decisions (lightweight ADR vibes)
Conventions — commit style, branch rules, when to test
Stuff Claude should not do automatically
Schemas, security boundaries, weird gotchas

This is the first thing Claude reads. If a session keeps going "wait what is this project again?" — that's me failing at writing this file, not Claude failing at reading it.

2. The `.claude/` folder 📂

This is where the harness actually lives:

.claude/
├── settings.json     # permissions, env, hook routing
├── hooks/            # shell scripts for lifecycle stuff
├── agents/           # sub-agents for narrow tasks
└── commands/         # slash commands I keep retyping

Hooks are 🐐. Highest-leverage piece for me, by a lot. My standard set across projects:

Session-start hook that injects fresh context (latest CLAUDE.md + recent commits)
TDD runner (tests run before certain edits go through)
Type check + lint gate
Pre-push security scan
Commit message validator

They just... run. I never have to remember.

Sub-agents I keep sparse. Honestly I made a bunch and deleted most of them because they weren't pulling their weight. The ones that survived have very narrow jobs — like a security-review agent that only sees the diff and a checklist. Nothing else.

Slash commands are literally just "prompts I'm tired of typing."

3. Same stack, every project 🧱

I have four active repos. They all share basically the same stack: Next.js + TypeScript strict + Tailwind + Zustand + Supabase/PostgreSQL + Railway + Playwright.

Is this the best stack? No clue tbh 🙃 But because it's consistent, my hooks and CLAUDE.md patterns just... work in any of them. No rewriting from scratch every time.

It's all implicit copy-paste though. Nothing extracted to a real template yet. Def on the todo list.

4. The watcher 👁️

One of those four repos is a thing I built that watches the others. Runs static analysis on my commits, does an AI code review pass, scores the code on a few axes, and pings me on Telegram if something's bad. Can block merges if the score tanks.

Why is this part of the harness? Because without it I have zero signal on whether the harness is making my code better over time, or just faster. Might honestly be the most important piece.

Stuff I've figured out (so far) 💭

Building the harness IS the work. I really underestimated how much time goes into tweaking CLAUDE.md, tuning hooks, killing sub-agents that turned out useless. It's not config-tweaking, it's actual engineering.

Claude reflects whatever I give it. Garbage CLAUDE.md → garbage output. Hooks don't run → quality gate doesn't exist. The system is exactly as good as the harness around it. Full stop.

Decisions stay with me. I don't auto-accept Claude's suggestions wholesale. The harness exists to make the executor execute well — not to outsource my judgment. That line is way more important than any specific tool choice imo.

Take all this with salt 🧂. I'm a couple months in. One person, one set of projects. Your harness will probably look different. Use this as a starting list, not a recipe.

What's broken / missing 🚧

A lot tbh:

No real template extracted — every repo got hand-tuned
Sub-agents are barely documented, I'm running on memory more than I should
I don't actually measure "harness ROI" — just vibes + the watcher's scores
Hook failure modes aren't really tested. If a hook silently fails I probably won't notice for a while 💀

If you've solved any of this in your own setup I would genuinely love to know how 🙏

Anyway that's where I'm at. If your harness looks totally different, drop it in the comments — I'm in the phase where every other person's setup teaches me something.

Peace ✌️

I built a tool that runs static analysis + Claude AI review on every GitHub Push/PR — SCAManager

xzawed — Sat, 25 Apr 2026 13:44:00 +0000

It started as a small annoyance
PR reviews are always a chore. On a small team — or a side project I run alone — the "someone has to look at this" person is always me. And if you're pushing straight to main, code review effectively disappears.
I started by stacking pylint and flake8 on top of GitHub Actions. But those don't answer the questions that actually matter: did this change do what I meant it to? Or does the commit message actually describe what changed? Static analysis catches grammar and style. It can't read intent.
So I asked Claude to review the same diffs, fused both signals together, scored them out of 100, and pushed the result to Telegram. That became SCAManager.
GitHub: https://github.com/xzawed/SCAManager

What it does
When a GitHub Webhook fires for a Push or PR event, the following runs in parallel:

Static analysis — pylint, flake8, bandit
AI code review — Claude Haiku 4.5
Commit message evaluation — Claude AI

Results map to a 100-point score and an A–F grade, then ship to whichever of the nine channels you've configured: Telegram, GitHub PR Comment, GitHub Commit Comment, GitHub Issue, Discord, Slack, Email, Generic Webhook, n8n.
For PRs, the score drives the gate automatically:

Auto mode — Above threshold → GitHub APPROVE. Below → REQUEST_CHANGES.
Semi-auto mode — Inline buttons in Telegram for manual approval.
Auto-merge — Above a separate threshold → squash merge.

The scoring system — why these weights
ItemPointsEvaluatorCode quality25pylint + flake8Security20banditCommit message15Claude AIImplementation direction25Claude AITest coverage15Claude AITotal100
Things machines see well go to machines (pylint, bandit). Things that need human judgment go to AI. AI evaluations come back on a 0–10 or 0–20 scale, then get re-weighted into the final score.
If ANTHROPIC_API_KEY isn't set, the AI items default to a neutral middle, and static analysis alone can still hit 89 points (B grade) at most. The tool isn't useless without API spend.

Architecture — the parts that were interesting to build

asyncio.gather() for parallelism Running static analysis and AI review serially makes per-PR analysis time miserable. Wrapping them in asyncio.gather() collapses total wall-clock to whatever the slowest task is. I use asyncio.gather(return_exceptions=True) for the nine notification channels too — but here the goal is isolation, not speed. If Telegram is down, that shouldn't block Slack.
Idempotency — same SHA, no double work GitHub Webhooks get retransmitted (response timeouts, retries, etc.). Running the same commit SHA twice costs money and produces no new information, so I dedupe by SHA at the DB layer.

GitHub Push/PR
  └─ POST /webhooks/github  (HMAC-SHA256 verification)
       └─ BackgroundTask: run_analysis_pipeline()
            ├─ Repo register · SHA dedup (idempotency)
            ├─ asyncio.gather() ── parallel
            │    ├─ analyze_file() × N  (pylint · flake8 · bandit)
            │    └─ review_code()       (Claude AI)
            ├─ calculate_score() → grade
            ├─ run_gate_check()  [PR only]
            └─ asyncio.gather(return_exceptions=True) → notification

channels

Two ways to use the AI Same review, two call paths:

Server mode — Anthropic API. Needs ANTHROPIC_API_KEY. Costs money.
Local hook mode — Claude Code CLI (claude -p). Runs locally, no API key needed.

Local hook mode runs as a pre-push git hook. Output goes to terminal and to the dashboard. Environments without the CLI (Codespaces, mobile) silently skip the hook — exit 0 always, never blocks the push.

DB Failover I built a FailoverSessionFactory that switches over to a fallback PostgreSQL when primary dies. /health reports which DB is currently active. Honestly, this is probably over-engineered. Whether a small side project actually needs failover is a separate question — building it was largely a learning exercise.

Limits and trade-offs
This tool isn't going to fit every team. Being honest about it:

Python-only — Static analysis is pylint/flake8/bandit. For non-Python repos, only the AI review piece gives you value.
AI score consistency — LLM output isn't 100% deterministic. The score is for spotting trends, not as a hard, trustworthy number.
API cost — Teams shipping big PRs frequently can rack up Claude API spend fast. File filters and thresholds give you some control, but it's a real cost line.
Auto-merge risk — Score-driven squash merge is convenient and dangerous. Validate your threshold settings before turning it on. Start in semi-auto mode.

If you want to try it

Repo: https://github.com/xzawed/SCAManager
License: MIT
Required: Python 3.13 · PostgreSQL · GitHub OAuth App
Optional: ANTHROPIC_API_KEY · Telegram Bot Token · SMTP

Easiest deploy: Railway with the PostgreSQL plugin and your env vars filled in. For on-prem, uvicorn + nginx + systemd works fine.
Feedback, issues, and "wait, is this actually how it should behave?" reports are all welcome.