- June 12, 2025 — 7+ hour outage; North America, Europe, Far East, Africa simultaneously
- Root cause: null pointer exception in Service Control from a May 29 code change — dormant for 14 days
- No feature flag, no error handling on the new code path
- 50+ Google Cloud services affected: IAM, Compute Engine, Cloud Storage, BigQuery, Vertex AI, Google Workspace
- Third-order cascade: Google → Cloudflare → Discord/Twitch; Discord users had no idea why they were down
- Herd effect during recovery overwhelmed Spanner in us-central1, extending the outage by 2+ hours
On May 29, 2025, a Google engineer deployed new quota-checking code to Service Control — the system that authorises every single API request across Google Cloud. The code had a bug: it couldn't handle a null value. But the bug was invisible during deployment because it could only be triggered by specific policy data that hadn't appeared yet. Two weeks later, an automated system pushed a routine policy update containing blank fields. The policy data replicated globally within seconds. Every Service Control binary in every region hit the null pointer, crashed, and refused to restart properly. Spotify went down. Discord went down. Snapchat went down. Google's own status page went down. And when engineers deployed the fix, the restart surge overwhelmed the infrastructure — making recovery worse than the crash.
The Story
On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.
— Google Cloud, Official Incident Report, June 14, 2025
Service Control is not a product you've heard of. It doesn't have a marketing page or a conference talk. It exists in the infrastructure layer beneath everything else — the system that authorises every API request across Google Cloud and Google Workspace before that request is allowed to proceed. If you call the Cloud Storage API, Service Control checks your quota. If you authenticate with Google IAM, Service Control validates your policy. If your app on Google Cloud makes any call to any Google service, Service Control is in the critical path. It is, in the most literal sense, the gatekeeper of the entire platform.
When Service Control crashed on June 12, it didn't just take down one service. It took down the authorisation layer for every service. API calls returned 503 errors not because the underlying services had failed, but because the gatekeeper wasn't there to let them through. Compute Engine instances were running. Cloud Storage buckets were intact. BigQuery jobs were ready to execute. None of it mattered — because without Service Control, nothing could be authorised, and nothing unauthorised can proceed in a correctly secured cloud platform.
Problem
May 29: Code Deployed — Bug Present, But Invisible
Google engineers deployed new quota policy checking code to Service Control. The deployment went through the standard region-by-region rollout and passed all checks. But the new code path had two critical gaps: no error handling for null values, and no feature flag to disable it if something went wrong. The bug was invisible during rollout because the problematic code path could only be triggered by blank fields in the policy metadata. That input hadn't appeared during rollout. The binary was now running in every region with a loaded trap, waiting for the right trigger.
Cause
June 12, 10:45 AM PDT: The Policy Update That Pulled the Trigger
An automated system inserted a routine policy change into the regional Spanner tables that Service Control uses for policy metadata. The policy update contained unintended blank fields. Because quota management is global, Spanner's replication engine distributed this metadata worldwide within seconds. Every Service Control binary in every region hit the new code path, encountered the null values, and threw a null pointer exception. Without error handling, the exception crashed the binary. Service Control was dead globally.
Solution
The SRE Response: Diagnosis in 10 Minutes, Red Button in 40
Google's SRE team began triaging within two minutes of the first alert. They identified the root cause — the null pointer exception in the new quota checking code path — within 10 minutes. Engineers deployed a 'red button' kill switch within 40 minutes to disable the problematic serving path. Most regions began recovering within two hours.
Result
The Herd Effect: When Recovery Made Things Worse
As Service Control instances restarted in us-central1 after the red button was deployed, they all simultaneously reached for the regional Spanner database to load their policy metadata. Hundreds of instances, all restarting at the same moment, all hitting Spanner at once, with no randomisation in their startup sequence. Spanner was overwhelmed by the simultaneous burst. Service Control couldn't load its policies, couldn't restart properly, kept trying, kept hitting Spanner. The recovery created a herd effect that prolonged the outage in us-central1 by more than two hours beyond when other regions had stabilised. Full resolution wasn't complete until 18:18 PDT — more than seven hours after the incident began.
The Fix
Google's Response: Five Commitments After the Outage
Google's five-category post-incident remediation plan (from the official June 14, 2025 incident report):
| Failure Mode | What Happened | Google's Fix |
|---|---|---|
| Missing error handling | Null pointer exception crashed the binary when blank fields appeared | Mandatory null-safe code patterns with static analysis to catch null pointer vulnerabilities before deployment |
| No feature flag | New code path couldn't be disabled without full binary redeployment — adding 30+ min to response | Feature flag protection required for all new Service Control code paths |
| Herd effect during recovery | Hundreds of instances restarting simultaneously overwhelmed Spanner | Randomised exponential backoff on Service Control startup |
| Status page availability | Cloud Service Health dashboard went down during the outage | Decouple status infrastructure from the services it monitors |
| Service Control architecture | Monolithic binary — crash in quota logic crashes all authorisation | Modularise Service Control — isolate quota checking from authentication |
- 10 min — time for Google's SRE team to identify the root cause from the first alert at 10:49 AM PDT
- 40 min — time to deploy the red button kill switch that disabled the problematic code path
- 7+ hrs — total outage duration; most regions recovered in ~2 hours, herd effect in us-central1 extended full resolution
- 50+ — Google Cloud services affected, including all core infrastructure APIs, all Google Workspace products, and all AI/ML services
The dormant trap pattern that caused this outage is worth naming explicitly. Google's staged, region-by-region rollout is exactly the right practice for catching bugs introduced by new deployments. It worked correctly for 14 days — no failures appeared during the May 29 rollout because the failure condition required specific policy data (blank fields) that hadn't yet been inserted. Staged rollouts are structurally unable to catch dormant traps — bugs that only activate when a specific trigger arrives weeks later from an unrelated automated system. The only defences against dormant traps are error handling (so the crash doesn't happen when the trigger arrives) and feature flags (so the code path can be disabled immediately when the trigger produces unexpected behaviour). The May 29 change had neither.
The herd effect: a recovery anti-pattern with a known fix
The herd effect that prolonged the us-central1 outage is not a new problem. It has been documented since the earliest days of distributed systems: when many clients restart simultaneously after a shared dependency recovers, they all connect at once and overwhelm the dependency, preventing it from returning to steady state. The canonical solution — randomised exponential backoff — is equally well-documented and simple: when restarting, add a random delay so clients stagger their reconnection attempts over a time window rather than clustering them at a single instant. Every Service Control instance waiting exactly zero milliseconds before hitting Spanner is the problem. Instances waiting a random delay between 0 and 30 seconds is the solution. Google committed to implementing this. The fact that it required an outage to prompt the implementation is a reminder that known fixes for known problems often go unimplemented until the cost is paid in production.
The status page that went dark
Google's Cloud Service Health dashboard went offline during the June 12 outage because the status infrastructure shared a dependency on the same Google Cloud services that were failing. A status page that fails during a widespread outage is not just unhelpful — it is actively harmful. Customers experiencing failures couldn't access the standard channel to confirm they weren't the source of the problem, couldn't track recovery progress, and couldn't communicate accurate information to their own stakeholders. The status page being down created a second outage: an outage of information. A status page that goes down during the incident it's supposed to report is a monitoring anti-pattern at its most consequential.
Architecture
Service Control sits at the intersection of every API request Google Cloud processes. Understanding how it failed — and why the failure spread so quickly and recovered so slowly — requires understanding three things: the role of Spanner as the global policy data store, the absence of safe failure handling in the new code path, and the herd effect as a predictable consequence of synchronised restart under load.
The blast radius of the June 12 outage had three concentric rings:
| Failure Ring | What Failed | Why |
|---|---|---|
| First: Google's own infrastructure | Cloud IAM, Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Vertex AI, Cloud Monitoring, Google Workspace | Service Control crashed globally, blocking all API authorisation |
| Second: Direct GCP customers | Spotify (~46K reports), Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor | Applications on GCP couldn't authorise any backend calls — services appeared down even though underlying compute was running |
| Third: Cloudflare and its customers | Cloudflare (partial), Discord, Twitch | Cloudflare uses Google Cloud for certain backend operations; those degraded, cascading to Cloudflare's own customers |
Normal Flow vs June 12 Failure: What Service Control Does on Every Request
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
The Herd Effect: Why Us-Central1 Recovery Took 2+ Extra Hours
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Lessons
Error handling is not optional for code that runs in the critical path of a globally distributed system. The null pointer exception that crashed Service Control was caused by a missing null check. Any code path that processes external data — data that arrives from an automated system and could contain unexpected values — must explicitly handle the unexpected cases. Blank fields in policy metadata is a predictable input variation. The code should have anticipated it.
Feature flags (a software engineering practice where new code is deployed but kept inactive until explicitly enabled via configuration, allowing teams to disable problematic features instantly without redeployment) on infrastructure code are not optional — they are the minimum viable safety mechanism for any code that processes global-scale policy data. The difference between "feature flag enabled, issue caught in staging" and "no feature flag, 7-hour global outage" is one line of configuration.
The thundering herd (a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming it and preventing it from returning to stable operation) is a known failure mode with a known fix: randomised exponential backoff. Build randomised backoff into any service that has a shared dependency it needs to reconnect to after a failure. This has been documented for decades. The fact that Service Control lacked it is a reminder that known fixes for known problems often go unimplemented until the cost of not implementing them is paid.
Your monitoring infrastructure must be architecturally independent of the services it monitors. No shared dependencies between the monitoring stack and the application stack. The moment customers need status information most is exactly the moment a shared-dependency status page is most likely to be unavailable.
Third-order cascade failures are invisible until they happen. Discord's users had no idea their outage originated in a null pointer in Google's quota management code. The dependency chain was opaque: Discord → Cloudflare → Google Cloud → Service Control → policy metadata blank fields. Every engineering team should map their dependency chain at least two levels deep — not just "we use Cloudflare" but "Cloudflare uses Google Cloud, and a Google Cloud outage of sufficient scope will reach us through Cloudflare."
Engineering Glossary
Dormant trap — a bug present in production code that cannot be triggered by any input present at deployment time, but activates when a specific trigger arrives later from an unrelated system. The May 29 Service Control change was a dormant trap: it executed correctly for 14 days until the automated policy update inserted blank fields. Staged rollouts are structurally unable to catch dormant traps.
Feature flag (kill switch) — a configuration switch that enables or disables a code path without requiring redeployment. The absent safeguard in this incident. A feature flag on the new quota checking code path would have allowed it to be disabled across all regions within seconds when null pointer exceptions began firing.
Herd effect (thundering herd) — a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming the resource and preventing it from returning to stable operation. The mechanism that extended the us-central1 outage by 2+ hours after the red button was deployed.
Null pointer exception — a runtime error that occurs when code attempts to use a reference that points to no object (null). The missing null check in Service Control's new quota checking code that caused a 7-hour global outage for 50+ cloud services.
Randomised exponential backoff — a retry strategy where clients wait a random delay that increases exponentially with each retry attempt. The standard solution to the thundering herd problem — prevents synchronised reconnection bursts by distributing client attempts across a time window.
Service Control — Google's internal authorisation gateway that processes every API request across Google Cloud and Google Workspace. Performs authentication, authorisation, and quota enforcement on every call. A crash in Service Control takes down all API authorisation for the entire platform — making it the highest-blast-radius single component in Google Cloud.
Spanner — Google's globally distributed database, engineered to replicate data to all regions in real time (typically within seconds). Used by Service Control for policy metadata. The same replication speed that makes Spanner powerful for global consistency made this bug's blast radius global and instantaneous.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)