- A Google Cloud outage caused by an automated account suspension took Railway’s entire platform offline for nearly 8 hours on May 19.
- The Google Cloud outage cascaded beyond GCP itself, knocking out Railway workloads running on AWS and its own Railway Metal infrastructure.
- Railway’s control plane architecture meant that when GCP went down, even healthy servers on other providers became completely unreachable.
- GitHub rate-limiting Railway’s OAuth integrations added a second layer of disruption during the recovery, blocking logins and builds simultaneously.
A Single Wrong Call Took Down an Entire Platform
A Google Cloud outage — triggered not by a hardware failure or network fault, but by an erroneous automated account suspension — took Railway completely offline for nearly eight hours last week. The incident, which ran from 22:20 UTC on May 19 to roughly 06:14 UTC on May 20, 2026, is a sharp reminder of how much operational risk modern infrastructure companies are quietly absorbing by depending on a single cloud provider for critical control-plane services.
Railway confirmed in its incident report that the Google Cloud outage occurred when Google suspended its production account as part of a broader automated action affecting many accounts simultaneously. There was no prior warning. No outreach. Just services going dark.
Within minutes, users were hitting 503 errors on the dashboard and API. Login was impossible. The familiar “no healthy upstream” and “unconditional drop overload” messages started flooding in. Railway’s on-call team was paged at 22:10 UTC — ten minutes before the full suspension hit — when automated monitoring caught the first API health check failures.
Why the Google Cloud Outage Spread Far Beyond Google Cloud
Here’s where this gets genuinely interesting, and where Railway deserves credit for being unusually transparent: the Google Cloud outage didn’t stay contained to GCP. Railway runs workloads on its own Railway Metal bare-metal infrastructure and on AWS as a burst-cloud environment. Both of those stayed up. The actual compute kept running. And yet users on those environments also went dark.
Why? Because Railway’s edge proxies depend on a network control plane that lives inside Google Cloud. That control plane populates the routing tables that tell the edge where to send traffic. As long as the cache those proxies maintain held, everything looked fine — workloads on Metal and AWS kept serving requests. But once those cached routes expired, starting around 22:35 UTC, the edge had no way to resolve where anything was. Healthy servers became invisible. The whole network started returning 404s.
It’s a textbook example of a cascading failure — a concept Google’s own Site Reliability Engineering literature covers extensively. A dependency that feels peripheral turns out to be load-bearing. Railway acknowledges this directly: “We take full responsibility for the architectural decisions that allowed a single upstream provider action to cascade into a platform-wide outage.”
That kind of accountability is rarer than it should be in incident reports, which too often read like exercises in blame-diffusion. Railway’s post-mortem doesn’t do that.
The Recovery: Slower Than You’d Hope
Account access was technically restored by 22:29 UTC — just nine minutes after the P0 ticket was filed with Google and Railway’s GCP account manager was engaged directly. Fast, relatively speaking. But restoring access to an account and restoring a working production environment are very different things, and that gap is where the next several hours were spent.
The persistent disks — where databases and stateful services live — didn’t come back online until 23:09 UTC for the first one, with all disks reaching a ready state by 23:54 UTC. But the network was still down. Compute instances didn’t start recovering until 01:30 UTC the following morning. Edge traffic resumed at 01:38 UTC. Orchestration and build infrastructure followed at 01:57 UTC.
At that point, a new problem emerged: a massive backlog of queued deployments all trying to execute at once. Railway deliberately paused deploys to avoid overwhelming the recovering systems — a smart call, though it meant users were staring at stuck pipelines for longer than they wanted.
Then, at 02:47 UTC, GitHub started rate-limiting Railway’s OAuth and webhook integrations. The cache-clearing caused by the Google Cloud outage had triggered a spike in calls to GitHub’s APIs, and GitHub’s automated systems responded accordingly. This blocked logins and builds for a second time during the recovery window — a painful compounding failure that had nothing to do with Google Cloud directly, but was a direct consequence of the original disruption.
The dashboard came back at 02:55 UTC. Deployments started processing again across all tiers by 03:59 UTC. The incident moved to monitoring at 06:14 UTC and was fully resolved by 07:58 UTC — nearly ten hours after the first alert fired.
As a footnote, Terms of Service acceptance records were wiped by the incident, prompting users to re-accept on their next login. A minor annoyance in context, but not a great look for a platform trying to reassure customers after a major outage.
What Railway Says It’s Fixing
Railway’s incident report stops short of naming the specific architectural changes it plans to make — which is understandable given these things take time — but the direction is clear. The core problem was that a single upstream provider had too much structural influence over Railway’s entire network layer. Even a multi-cloud setup with AWS and proprietary metal wasn’t enough, because the routing intelligence was still centralised in GCP. The Google Cloud outage exposed exactly how fragile that centralisation makes an otherwise distributed platform.
The fix, almost certainly, involves decentralising or replicating the network control plane so that no single provider suspension can poison the routing tables. That probably means running control plane replicas across providers, extending cache TTLs, and building automated failover that doesn’t require human intervention to kick in.
Whether Railway also pushes Google Cloud for answers about the erroneous suspension itself is a separate question. The incident report notes it was an automated action affecting many accounts simultaneously — which suggests this wasn’t a Railway-specific issue, and other companies may have been hit by the same Google Cloud outage without publishing incident reports.
The Bigger Picture for Cloud-Dependent Startups
Railway isn’t alone in this exposure. Plenty of developer-focused platforms — Render, Fly.io, Heroku in its heyday — run critical infrastructure on top of hyperscalers while simultaneously trying to offer an abstraction layer above them. The business logic makes sense: AWS, GCP, and Azure have the global reach and reliability guarantees that startups can’t build themselves. But it creates a dependency structure where your customers are one automated policy decision away from a Google Cloud outage you can’t control or predict.
This particular Google Cloud outage was resolved in hours. A longer suspension, or one that hit during peak business hours in Western markets rather than overnight UTC, could have been far more damaging — both operationally and reputationally. For Railway’s customers, many of whom are running production workloads for their own businesses, eight hours of downtime isn’t just an inconvenience. It’s a genuine business continuity problem.
The incident is also a useful stress test of Railway’s transparency practices. Publishing a detailed timeline, admitting architectural fault, and committing to structural changes is the right approach. It won’t prevent every customer from churning, but it builds the kind of trust that sustains a platform long-term. In a market where developers have real alternatives, how you handle failures often matters as much as how often they happen.
Source: https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage

