DEV Community

Cover image for Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse
TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse

  • June 12, 2025 — 7+ hour outage; North America, Europe, Far East, Africa simultaneously
  • Root cause: null pointer exception in Service Control from a May 29 code change — dormant for 14 days
  • No feature flag, no error handling on the new code path
  • 50+ Google Cloud services affected: IAM, Compute Engine, Cloud Storage, BigQuery, Vertex AI, Google Workspace
  • Third-order cascade: Google → Cloudflare → Discord/Twitch; Discord users had no idea why they were down
  • Herd effect during recovery overwhelmed Spanner in us-central1, extending the outage by 2+ hours

On May 29, 2025, a Google engineer deployed new quota-checking code to Service Control — the system that authorises every single API request across Google Cloud. The code had a bug: it couldn't handle a null value. But the bug was invisible during deployment because it could only be triggered by specific policy data that hadn't appeared yet. Two weeks later, an automated system pushed a routine policy update containing blank fields. The policy data replicated globally within seconds. Every Service Control binary in every region hit the null pointer, crashed, and refused to restart properly. Spotify went down. Discord went down. Snapchat went down. Google's own status page went down. And when engineers deployed the fix, the restart surge overwhelmed the infrastructure — making recovery worse than the crash.


The Story

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.

— Google Cloud, Official Incident Report, June 14, 2025

Service Control is not a product you've heard of. It doesn't have a marketing page or a conference talk. It exists in the infrastructure layer beneath everything else — the system that authorises every API request across Google Cloud and Google Workspace before that request is allowed to proceed. If you call the Cloud Storage API, Service Control checks your quota. If you authenticate with Google IAM, Service Control validates your policy. If your app on Google Cloud makes any call to any Google service, Service Control is in the critical path. It is, in the most literal sense, the gatekeeper of the entire platform.

When Service Control crashed on June 12, it didn't just take down one service. It took down the authorisation layer for every service. API calls returned 503 errors not because the underlying services had failed, but because the gatekeeper wasn't there to let them through. Compute Engine instances were running. Cloud Storage buckets were intact. BigQuery jobs were ready to execute. None of it mattered — because without Service Control, nothing could be authorised, and nothing unauthorised can proceed in a correctly secured cloud platform.


What Service Control Actually Does

Google Cloud's Service Control performs three functions on every API request: authentication (is this requester who they claim to be?), authorisation (are they allowed to perform this operation?), and quota enforcement (have they exceeded their usage limits?). It processes these checks at massive scale across every region — billions of API calls per day — using policy metadata stored in and synchronised across Spanner, Google's globally distributed database. The May 29 code change was adding more sophisticated quota checking logic to this pipeline. The change worked correctly in every scenario that was tested. The scenario that wasn't tested was the one that appeared on June 12.

Problem

May 29: Code Deployed — Bug Present, But Invisible

Google engineers deployed new quota policy checking code to Service Control. The deployment went through the standard region-by-region rollout and passed all checks. But the new code path had two critical gaps: no error handling for null values, and no feature flag to disable it if something went wrong. The bug was invisible during rollout because the problematic code path could only be triggered by blank fields in the policy metadata. That input hadn't appeared during rollout. The binary was now running in every region with a loaded trap, waiting for the right trigger.


Cause

June 12, 10:45 AM PDT: The Policy Update That Pulled the Trigger

An automated system inserted a routine policy change into the regional Spanner tables that Service Control uses for policy metadata. The policy update contained unintended blank fields. Because quota management is global, Spanner's replication engine distributed this metadata worldwide within seconds. Every Service Control binary in every region hit the new code path, encountered the null values, and threw a null pointer exception. Without error handling, the exception crashed the binary. Service Control was dead globally.


Solution

The SRE Response: Diagnosis in 10 Minutes, Red Button in 40

Google's SRE team began triaging within two minutes of the first alert. They identified the root cause — the null pointer exception in the new quota checking code path — within 10 minutes. Engineers deployed a 'red button' kill switch within 40 minutes to disable the problematic serving path. Most regions began recovering within two hours.


Result

The Herd Effect: When Recovery Made Things Worse

As Service Control instances restarted in us-central1 after the red button was deployed, they all simultaneously reached for the regional Spanner database to load their policy metadata. Hundreds of instances, all restarting at the same moment, all hitting Spanner at once, with no randomisation in their startup sequence. Spanner was overwhelmed by the simultaneous burst. Service Control couldn't load its policies, couldn't restart properly, kept trying, kept hitting Spanner. The recovery created a herd effect that prolonged the outage in us-central1 by more than two hours beyond when other regions had stabilised. Full resolution wasn't complete until 18:18 PDT — more than seven hours after the incident began.


The Fix

Google's Response: Five Commitments After the Outage

Google's five-category post-incident remediation plan (from the official June 14, 2025 incident report):

Failure Mode What Happened Google's Fix
Missing error handling Null pointer exception crashed the binary when blank fields appeared Mandatory null-safe code patterns with static analysis to catch null pointer vulnerabilities before deployment
No feature flag New code path couldn't be disabled without full binary redeployment — adding 30+ min to response Feature flag protection required for all new Service Control code paths
Herd effect during recovery Hundreds of instances restarting simultaneously overwhelmed Spanner Randomised exponential backoff on Service Control startup
Status page availability Cloud Service Health dashboard went down during the outage Decouple status infrastructure from the services it monitors
Service Control architecture Monolithic binary — crash in quota logic crashes all authorisation Modularise Service Control — isolate quota checking from authentication
  • 10 min — time for Google's SRE team to identify the root cause from the first alert at 10:49 AM PDT
  • 40 min — time to deploy the red button kill switch that disabled the problematic code path
  • 7+ hrs — total outage duration; most regions recovered in ~2 hours, herd effect in us-central1 extended full resolution
  • 50+ — Google Cloud services affected, including all core infrastructure APIs, all Google Workspace products, and all AI/ML services

The Feature Flag That Would Have Saved Seven Hours

The most consequential missing safeguard was the absence of a feature flag on the new quota checking code path. A feature flag would have changed the timeline dramatically: when null pointer exceptions began firing at 10:49 AM PDT, engineers with a feature flag could have disabled the new code path across all regions within seconds — before the crash had spread globally. Without a feature flag, the only option was a red-button kill switch requiring a new binary deployment: 40 minutes. 40 minutes of global outage versus seconds of a feature flag toggle. Google's incident report acknowledges this directly: "If this had been flag protected, the issue would have been caught in staging."

The dormant trap pattern that caused this outage is worth naming explicitly. Google's staged, region-by-region rollout is exactly the right practice for catching bugs introduced by new deployments. It worked correctly for 14 days — no failures appeared during the May 29 rollout because the failure condition required specific policy data (blank fields) that hadn't yet been inserted. Staged rollouts are structurally unable to catch dormant traps — bugs that only activate when a specific trigger arrives weeks later from an unrelated automated system. The only defences against dormant traps are error handling (so the crash doesn't happen when the trigger arrives) and feature flags (so the code path can be disabled immediately when the trigger produces unexpected behaviour). The May 29 change had neither.

The herd effect: a recovery anti-pattern with a known fix
The herd effect that prolonged the us-central1 outage is not a new problem. It has been documented since the earliest days of distributed systems: when many clients restart simultaneously after a shared dependency recovers, they all connect at once and overwhelm the dependency, preventing it from returning to steady state. The canonical solution — randomised exponential backoff — is equally well-documented and simple: when restarting, add a random delay so clients stagger their reconnection attempts over a time window rather than clustering them at a single instant. Every Service Control instance waiting exactly zero milliseconds before hitting Spanner is the problem. Instances waiting a random delay between 0 and 30 seconds is the solution. Google committed to implementing this. The fact that it required an outage to prompt the implementation is a reminder that known fixes for known problems often go unimplemented until the cost is paid in production.

The status page that went dark
Google's Cloud Service Health dashboard went offline during the June 12 outage because the status infrastructure shared a dependency on the same Google Cloud services that were failing. A status page that fails during a widespread outage is not just unhelpful — it is actively harmful. Customers experiencing failures couldn't access the standard channel to confirm they weren't the source of the problem, couldn't track recovery progress, and couldn't communicate accurate information to their own stakeholders. The status page being down created a second outage: an outage of information. A status page that goes down during the incident it's supposed to report is a monitoring anti-pattern at its most consequential.


Architecture

Service Control sits at the intersection of every API request Google Cloud processes. Understanding how it failed — and why the failure spread so quickly and recovered so slowly — requires understanding three things: the role of Spanner as the global policy data store, the absence of safe failure handling in the new code path, and the herd effect as a predictable consequence of synchronised restart under load.

The blast radius of the June 12 outage had three concentric rings:

Failure Ring What Failed Why
First: Google's own infrastructure Cloud IAM, Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Vertex AI, Cloud Monitoring, Google Workspace Service Control crashed globally, blocking all API authorisation
Second: Direct GCP customers Spotify (~46K reports), Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor Applications on GCP couldn't authorise any backend calls — services appeared down even though underlying compute was running
Third: Cloudflare and its customers Cloudflare (partial), Discord, Twitch Cloudflare uses Google Cloud for certain backend operations; those degraded, cascading to Cloudflare's own customers

Normal Flow vs June 12 Failure: What Service Control Does on Every Request

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Herd Effect: Why Us-Central1 Recovery Took 2+ Extra Hours

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).


The Global Spanner Replication Trap

The reason the June 12 failure was global rather than regional was Spanner's design strength working against Google in this case. Spanner is engineered to replicate data to all regions in real time — typically within seconds. When the automated system inserted the policy update with blank fields into the regional Spanner tables, Spanner replicated that policy data to every region within seconds. Every Service Control instance in every region hit the null pointer at essentially the same moment. There was no regional staging, no propagation delay, no opportunity for an alert to fire in one region before the failure had spread to all others. The same architecture that gives Spanner its global consistency guarantee gave this bug its global blast radius.


Lessons

  1. Error handling is not optional for code that runs in the critical path of a globally distributed system. The null pointer exception that crashed Service Control was caused by a missing null check. Any code path that processes external data — data that arrives from an automated system and could contain unexpected values — must explicitly handle the unexpected cases. Blank fields in policy metadata is a predictable input variation. The code should have anticipated it.

  2. Feature flags (a software engineering practice where new code is deployed but kept inactive until explicitly enabled via configuration, allowing teams to disable problematic features instantly without redeployment) on infrastructure code are not optional — they are the minimum viable safety mechanism for any code that processes global-scale policy data. The difference between "feature flag enabled, issue caught in staging" and "no feature flag, 7-hour global outage" is one line of configuration.

  3. The thundering herd (a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming it and preventing it from returning to stable operation) is a known failure mode with a known fix: randomised exponential backoff. Build randomised backoff into any service that has a shared dependency it needs to reconnect to after a failure. This has been documented for decades. The fact that Service Control lacked it is a reminder that known fixes for known problems often go unimplemented until the cost of not implementing them is paid.

  4. Your monitoring infrastructure must be architecturally independent of the services it monitors. No shared dependencies between the monitoring stack and the application stack. The moment customers need status information most is exactly the moment a shared-dependency status page is most likely to be unavailable.

  5. Third-order cascade failures are invisible until they happen. Discord's users had no idea their outage originated in a null pointer in Google's quota management code. The dependency chain was opaque: Discord → Cloudflare → Google Cloud → Service Control → policy metadata blank fields. Every engineering team should map their dependency chain at least two levels deep — not just "we use Cloudflare" but "Cloudflare uses Google Cloud, and a Google Cloud outage of sufficient scope will reach us through Cloudflare."


Engineering Glossary

Dormant trap — a bug present in production code that cannot be triggered by any input present at deployment time, but activates when a specific trigger arrives later from an unrelated system. The May 29 Service Control change was a dormant trap: it executed correctly for 14 days until the automated policy update inserted blank fields. Staged rollouts are structurally unable to catch dormant traps.

Feature flag (kill switch) — a configuration switch that enables or disables a code path without requiring redeployment. The absent safeguard in this incident. A feature flag on the new quota checking code path would have allowed it to be disabled across all regions within seconds when null pointer exceptions began firing.

Herd effect (thundering herd) — a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming the resource and preventing it from returning to stable operation. The mechanism that extended the us-central1 outage by 2+ hours after the red button was deployed.

Null pointer exception — a runtime error that occurs when code attempts to use a reference that points to no object (null). The missing null check in Service Control's new quota checking code that caused a 7-hour global outage for 50+ cloud services.

Randomised exponential backoff — a retry strategy where clients wait a random delay that increases exponentially with each retry attempt. The standard solution to the thundering herd problem — prevents synchronised reconnection bursts by distributing client attempts across a time window.

Service Control — Google's internal authorisation gateway that processes every API request across Google Cloud and Google Workspace. Performs authentication, authorisation, and quota enforcement on every call. A crash in Service Control takes down all API authorisation for the entire platform — making it the highest-blast-radius single component in Google Cloud.

Spanner — Google's globally distributed database, engineered to replicate data to all regions in real time (typically within seconds). Used by Service Control for policy metadata. The same replication speed that makes Spanner powerful for global consistency made this bug's blast radius global and instantaneous.


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)


TechLogStack — built at scale, broken in public, rebuilt by engineers.

Top comments (0)