DEV Community

Cover image for Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once
TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

Spotify · Reliability · 24 May 2026

On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their Envoy Proxy perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed. And then the restart loop began — a loop Kubernetes itself was powering, killing each new server as fast as it came back up. 675 million users couldn't load the app. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.

  • 12:18–15:45 UTC (3h 27min)
  • 675M MAU affected
  • 48,000+ Downdetector peak reports
  • All regions except Asia Pacific down
  • Envoy heap > K8s limit misconfiguration
  • Fixed: capacity increase broke the death loop

The Story

This crash happened simultaneously on all Envoy instances.

— — Spotify Engineering — Incident Report: Spotify Outage on April 16, 2025, engineering.atspotify.com

There is a specific kind of engineering failure that hurts more than the others: the change that was reviewed, discussed, and approved — the change the team looked at together and agreed was fine. On April 16, 2025, Spotify's team reordered the custom filters within their Envoy Proxy (an open-source, high-performance edge proxy originally built at Lyft and now widely used as the networking perimeter layer in distributed systems — it receives all incoming user traffic before distributing it to backend services) perimeter. This was not a new feature, not a database migration, not a major infrastructure overhaul. It was a filter reorder. The team assessed it as low risk. They applied it to all cloud regions simultaneously. Two minutes later, every single Envoy instance running Spotify's networking perimeter had crashed.

Spotify's perimeter is the first layer of software that receives traffic from every user worldwide — every stream request, every search, every login. It sits in front of all backend services and distributes traffic across cloud regions. To extend Envoy's capabilities, Spotify develops and maintains its own custom filters — plugins that run within Envoy to handle rate limiting, authentication, and other cross-cutting concerns. These filters execute in a defined order. The April 16 change altered that order. The new sequence triggered a latent bug in one of the custom filters : a code path that had existed undetected, harmless as long as the filter never received control at that position, suddenly activated. Envoy crashed. Not one instance, not one region. All of them.

THE DEATH LOOP: WHY THE RESTART MADE THINGS WORSE

An Envoy crash is normally survivable — Kubernetes detects the failed pod and starts a replacement. But what happened next on April 16 was not normal. The immediate restart of all Envoy instances, combined with client-side retry logic (every user's app and browser retrying its failed request), created an unprecedented traffic spike onto the new instances. Each new Envoy started, received the full flood of retry traffic, consumed more memory than the Kubernetes memory limit (the maximum memory a Kubernetes pod is allowed to use, defined in its resource spec — when a pod exceeds this limit, Kubernetes automatically terminates it regardless of what the pod is doing), and was automatically killed by Kubernetes. A new instance started. The same thing happened. The loop repeated — powered by Kubernetes itself — for hours.

Problem

12:18 UTC — Filter Reorder Applied Globally, All Envoy Instances Crash

The change to Envoy filter execution order was applied simultaneously to all cloud regions worldwide. The new order activated a latent bug in a custom Spotify filter. Every Envoy instance on Spotify's networking perimeter crashed at the same moment. Alarms fired two minutes later as the traffic drop became measurable.


Cause

The Hidden Misconfiguration: Heap Larger Than the K8s Memory Limit

The traffic flood from client retries exposed a misconfiguration that had existed undetected: Envoy's max heap size was configured higher than the Kubernetes memory limit for the pod. Under normal traffic, Envoy never approached its heap limit and the misconfiguration was invisible. Under the retry flood, each new instance immediately exceeded the K8s limit and was killed. This turned a crash into an infinite restart loop.


Solution

Asia Pacific Stayed Up — and Explained Everything

Asia Pacific was the only region unaffected. Engineers noticed and investigated why. The answer: lower traffic volume at that time of day (timezone difference) meant APAC Envoy instances never received enough retry traffic to exceed the K8s memory limit. The asymmetry proved the hypothesis: the death loop was memory-limit driven, not bug-driven. Fix the memory headroom, break the loop.


Result

15:45 UTC — Death Loop Broken, Full Recovery

Increasing total perimeter server capacity gave each new Envoy instance enough headroom to stay under the K8s memory limit even while absorbing the retry traffic flood. The death loop broke. Instances stabilized. EU recovered at 14:20 UTC, US at 15:10 UTC, full normalization at 15:40 UTC. Total duration: 3 hours 27 minutes.


🌏

Asia Pacific's survival was not a result of better engineering in that region. It was a result of time zones. The outage struck at 12:18 UTC — early morning in Europe and the US, but late evening in Asia Pacific where traffic was naturally lower. Fewer users → fewer retries → less memory pressure on new Envoy instances → stayed under the K8s limit. The region that wasn't affected was the one that happened to have the least traffic at the exact wrong moment.

⚠️

Why 'Low Risk' Was Wrong

The filter reorder was assessed as low risk for a defensible reason: reordering filters does not add new code or change individual filter logic. It changes the sequence in which existing, tested filters run. The team's mental model was correct for most cases — but it was missing one scenario: a latent bug in a filter that only activates when that filter receives control at a specific position in the execution chain. Latent bugs that depend on execution context are invisible to tests that don't vary that context. A filter integration test suite that exercises filters in isolation or in their original order will never catch a bug that only manifests in a new order.

The mechanism of the death loop is worth understanding in precise detail because it recurs across infrastructure outages in different forms. The pattern is: a failure event triggers a restart, the restart environment differs from steady state (here: retry flood instead of normal traffic), the restarted instance fails faster than under steady state, the restart mechanism itself (Kubernetes) becomes the engine of the failure. In Kubernetes deployments, the most common version of this pattern involves the OOMKill (Out-Of-Memory Kill — the Kubernetes mechanism that terminates a pod when it exceeds its configured memory limit, to protect other workloads on the node from memory starvation) cycle: a pod exceeds its memory limit under unexpected load, K8s kills it, the replacement starts with no warm state and faces the same load, K8s kills it again. The Spotify outage was this pattern at global perimeter scale.

ℹ️

EnvoyCon 2025: The Custom Filter Spotify Presented

Spotify's engineering team had discussed their custom Envoy filter work publicly — including their rate limiting filter — at EnvoyCon 2025 , just before the April 16 incident. The presentation described the same filter system that the April 16 change modified. The public talk was about the filter's capabilities; the April 16 postmortem was about what happened when its execution order changed. The two documents together give a rare complete picture: how the system was designed to work, and exactly how it broke.

The 263 million Spotify Premium subscribers who pay for the service experienced the same outage as the 412 million free-tier users. Spotify's architecture does not provide differentiated reliability between paid and unpaid users at the perimeter layer — the same Envoy proxy handles all traffic. This is consistent with how most streaming platforms operate: the perimeter is a shared resource, and a perimeter failure is total. The 48,000+ Downdetector reports at peak represented the fraction of users actively reporting issues; the actual count of affected users was in the hundreds of millions.

🔄

The Retry Amplification Problem

Client-side retry logic is a standard reliability feature: when a request fails, the app retries it, giving transient failures a chance to self-heal. During normal partial failures, retry logic helps. During a simultaneous total failure, retry logic becomes a load amplifier. Every user whose request failed immediately retried — some apps retry multiple times with exponential backoff. The simultaneous crash of all Envoy instances converted the normal traffic level into a retry-amplified spike: each failed request generated one or more retry requests, all arriving at the same moment the replacement instances were starting. The retry logic designed to improve reliability became a key component of the death loop.

263 Million Premium Subscribers: No Differentiation

Spotify's perimeter architecture treats all traffic identically — there is no fast lane for paid subscribers at the proxy layer. The 263 million Premium users who pay for an ad-free, uninterrupted experience were indistinguishable from the 412 million free-tier users when the perimeter crashed. This is not a design flaw; building separate perimeter infrastructure for each tier would add enormous complexity. But it means that the reliability guarantee implicit in a Premium subscription depends entirely on the reliability of shared perimeter infrastructure. When the perimeter fails totally, the premium experience fails identically to the free experience.


The Fix

The Misconfiguration Nobody Noticed — Until the Crash

The fix for the death loop was increasing perimeter server capacity — but this addressed the symptom, not the underlying misconfiguration. The root problem was that Envoy's max heap size was set higher than the Kubernetes memory limit for the pod. In normal operation, Envoy memory usage never approached its heap maximum. The misconfiguration was invisible: Envoy wasn't crashing, K8s wasn't killing pods, monitoring wasn't alerting. The gap between heap size and K8s limit existed in the configuration for an unknown period before April 16 — it simply never mattered because Envoy memory never climbed high enough to expose it. The retry flood was the first event extreme enough to push instances over the K8s limit and trigger the kill cycle.

  • 3h 27m — Total outage duration from first Envoy crash (12:18 UTC) to full global normalization (15:45 UTC) — spanning the North American morning commute and European afternoon
  • 675M — Spotify monthly active users worldwide — 263M paying Premium subscribers — all of whom experienced service degradation or complete unavailability during the incident
  • 48,000+ — Peak Downdetector reports — representing active user reports only; actual affected users numbered in the hundreds of millions globally excluding Asia Pacific
  • 0 — Regions with staged rollout before full deployment — the filter reorder was applied globally simultaneously because it was assessed as low risk, removing the safety net of incremental validation
# THE MISCONFIGURATION: Envoy heap limit higher than K8s memory limit
# This created a hidden gap that was invisible until the retry flood

# Kubernetes pod resource specification (simplified)
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: envoy
    resources:
      requests:
        memory: "2Gi"
      limits:
        memory: "3Gi" # K8s will OOMKill the pod above this

# Envoy overload manager configuration (simplified)
overload_manager:
  actions:
  - name: envoy.overload_actions.stop_accepting_requests
    triggers:
    - name: envoy.resource_monitors.fixed_heap
      threshold:
        value: 0.95 # Envoy's own heap limit: 95% of max heap
  resource_monitors:
  - name: envoy.resource_monitors.fixed_heap
    typed_config:
      max_heap_size_bytes: 4294967296 # 4GB: HIGHER than K8s 3GB limit!

# Result:
# - K8s kills at 3GB of memory usage
# - Envoy's own safety valve triggers at 95% of 4GB = 3.84GB
# - K8s limit is hit BEFORE Envoy's own graceful degradation kicks in
# - Under normal load: Envoy peaks at ~1.5GB — misconfiguration invisible
# - Under retry flood: Envoy climbs past 3GB → K8s OOMKills → restart
# → same flood → same OOMKill → infinite loop

# THE FIX (immediate): Increase perimeter server count
# More servers = same retry traffic spread across more instances
# = each instance stays under 3GB = K8s doesn't kill = loop breaks

# THE FIX (permanent): Align heap config with K8s memory limit
# max_heap_size_bytes: 2684354560 # 2.5GB: safely below K8s 3GB limit
Enter fullscreen mode Exit fullscreen mode

WHY INCREASING CAPACITY FIXED THE LOOP

The death loop's engine was the K8s OOMKill cycle. The K8s memory limit was fixed. The retry traffic load was fixed (determined by user behavior). The only variable Spotify could change quickly was the number of Envoy instances sharing that retry load. More instances → each instance receives a smaller share of the retry flood → each instance's memory usage stays lower → stays under the K8s limit → K8s doesn't kill it → stable. This is why increasing capacity broke the loop: it reduced per-instance memory pressure below the kill threshold. The underlying misconfiguration (heap > K8s limit) was fixed separately afterward as a permanent remediation.

ℹ️

Spotify's Four Post-Incident Commitments

Spotify's postmortem committed to four specific engineering changes: (1) Fix the Envoy filter bug that caused the initial crash on filter reorder. (2) Fix the configuration mismatch between Envoy heap size and Kubernetes memory limit. (3) Improve the rollout process for configuration changes to the perimeter — staged rather than global simultaneous. (4) Improve monitoring capabilities to detect these issues earlier in the failure chain. Notably, the postmortem linked directly to their EnvoyCon 2025 talk, showing transparency about which system was involved rather than obscuring the component.

What a Staged Rollout Would Have Caught

If the filter reorder had been applied to a single region first with a monitoring window before global rollout, the failure would have been a regional incident recoverable in minutes, not a global outage lasting 3.5 hours. The Envoy crash would have appeared in one region. Engineers would have investigated, found the latent bug, rolled back the filter order. Total blast radius: one region for ~10 minutes. The simultaneous global rollout removed this safety net entirely. The misconfiguration would still have existed — but would have been exposed only in one region rather than all at once.

Spotify Envoy Outage: Timeline of Events and Recovery Progression

Time (UTC) Event Status
12:18 Envoy filter order changed; all instances crash simultaneously 🔴 Global failure begins
12:20 Alarms triggered on traffic drop; death loop already running 🔴 Engineers paged
12:28 Situation escalated; only APAC serving traffic 🔴 Incident declared
~13:xx Root cause identified via APAC asymmetry; capacity increase planned 🟡 Diagnosis complete
14:20 EU regions fully recovered 🟡 Partial recovery
15:10 US regions fully recovered 🟡 Partial recovery
15:40 All traffic patterns normal globally 🟢 Full recovery

ℹ️

The Perimeter as a Shared Fate System

Spotify's Envoy perimeter is a shared fate system — all backend services rise and fall with it. Even if every backend service (streaming, search, auth, recommendations) remained perfectly healthy during the April 16 incident, users experienced total unavailability because the perimeter through which all requests flow was in a crash loop. This architectural property is why perimeter changes deserve the highest rollout rigor: a perimeter failure has a blast radius of every service, every user, every region simultaneously. The perimeter is where shared fate is most acute.


Architecture

Spotify's networking perimeter architecture places Envoy Proxy as the outermost layer — the first software that receives every user request, regardless of what feature or backend it is ultimately destined for. Envoy runs in every cloud region and distributes incoming traffic to the appropriate backend microservices. Custom filters extend Envoy's capabilities beyond the open-source defaults: rate limiting, authentication, request routing customization. Understanding the outage requires understanding that when every Envoy instance worldwide crashes simultaneously, no user request can reach any backend service — the entire platform goes dark regardless of whether individual backend services remain healthy.

Spotify's Perimeter Architecture: Envoy as the Universal Traffic Gateway

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Three-Layer Failure Cascade: From Filter Bug to Death Loop

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

THE ASIA PACIFIC DIAGNOSTIC: HOW ONE REGION PROVED THE ROOT CAUSE

The most important engineering insight in the April 16 incident was not the initial cause — it was using Asia Pacific's survival to prove the diagnosis. When engineers observed APAC was unaffected, they had two candidate hypotheses: (A) the filter bug is region-specific, or (B) the death loop is traffic-intensity dependent. If (A), APAC has different filter config. If (B), APAC has lower traffic at this hour. Investigation confirmed (B): APAC runs identical filter configuration. Lower traffic meant less retry amplification, meaning per-instance memory pressure never reached the K8s limit. This asymmetry transformed a hard debugging problem (why is the loop happening?) into a tractable one (what's different about APAC?) and pointed directly at the memory-limit misconfiguration.

⚠️

Configuration Drift in Long-Running Systems

The Envoy heap/K8s limit misconfiguration almost certainly existed long before April 16, 2025. It was never caught because Envoy memory usage never reached the dangerous threshold under normal traffic. This is a common pattern: configuration mismatches that are only dangerous under abnormal load go undetected indefinitely in systems where abnormal load doesn't occur. The misconfiguration didn't cause the outage — the filter bug did. But the misconfiguration was what turned a recoverable crash into a multi-hour global outage. Auditing resource limit configurations against actual peak usage — including synthetic stress tests — is the practice that catches these time bombs before they detonate.

🛡️

Envoy's Custom Filter Architecture

Spotify's custom Envoy filters are the engineering investment that makes this outage particularly instructive. Envoy provides a well-defined filter chain (the ordered sequence of processing modules (filters) that each request passes through in an Envoy proxy instance — each filter can inspect, modify, or reject the request before passing it to the next filter in the chain) mechanism for extensibility: developers write C++ or Lua filter plugins that plug into Envoy's request processing pipeline. The order of filters in the chain determines the sequence of execution. Spotify's filters included rate limiting (discussed at EnvoyCon 2025) and other custom logic. Changing this order is semantically meaningful: a filter that assumes it runs after authentication may behave incorrectly if it suddenly runs before it. The latent bug on April 16 was exactly this class of assumption.


Lessons

The Spotify April 2025 outage is one of the cleanest documented examples of how a reasonable assessment ('low risk') combined with an undetected misconfiguration produces a disproportionate outcome. The lessons here are deeply practical.

  1. 01. 'Low risk' is not a substitute for staged rollout at the perimeter. A change's risk profile determines what validation it needs — it doesn't override the need for validation. Changes to shared perimeter infrastructure that affect all users worldwide deserve incremental rollout regardless of their apparent complexity. The filter reorder was simple; the blast radius of a failure was total. Stage perimeter changes by region and monitor before expanding.
  2. 02. Latent bugs (code defects that exist in production but are harmless until a specific triggering condition occurs — they can be undetectable by standard testing if the triggering condition is rare or contextual) that depend on execution context cannot be caught by tests that don't vary that context. A filter test suite that exercises filters in their original order will never discover a bug that only manifests in a different order. When making ordering or sequencing changes, test explicitly in the new order — don't rely on existing test coverage that implicitly assumes the old order.
  3. 03. Audit resource limit configurations against actual and stress-test peak usage regularly. Mismatches between Envoy heap size and Kubernetes memory limits are invisible until a load event forces memory beyond the limit. The same pattern exists for thread pool sizes, connection pool limits, and file descriptor limits. A misconfiguration that's been harmless for months can become catastrophic under the right load spike.
  4. 04. Client-side retry logic (application behavior where the client automatically retries failed requests after a brief delay — designed to handle transient failures but capable of amplifying load during sustained failures) turns total simultaneous failures into traffic amplification events. Design retry logic with awareness of this: exponential backoff with jitter spreads retries over time; circuit breakers prevent retries when failure rate exceeds a threshold; retry budgets limit total retry volume per client. These mechanisms reduce the retry flood that powered Spotify's death loop.
  5. 05. When one region survives an outage that hits all others, that region is your fastest path to root cause. APAC's survival was not luck — it was a controlled experiment running in production. Its configuration was identical; its traffic was lower. The asymmetry proved the diagnosis. Building the habit of systematically comparing the surviving regions against the failed ones — rather than focusing exclusively on what went wrong — is the investigative discipline that shortens MTTR.

Spotify's Transparency Standard

The April 16 postmortem was published on May 9 — 23 days after the incident. It named the specific system involved (Envoy Proxy, their custom filters), linked directly to their EnvoyCon 2025 talk about the same system, included exact timestamps for every recovery milestone, and explained the death loop mechanism in precise technical terms. It also enumerated four specific engineering commitments — not aspirational language but concrete actions. This level of technical transparency in a public postmortem is rare and sets a standard. Spotify's engineering culture treats accountability as a tool for improvement, not a liability to be managed.

THE ENVOYCON IRONY

Spotify's engineering team had publicly presented their custom Envoy filter work — including the rate limiting filter — at EnvoyCon 2025, just weeks before the April 16 incident. The presentation described the filter system as a capability Spotify had built to enhance Envoy's performance. The April 16 postmortem described what happened when the execution order of that same filter system was changed without a staged rollout. The two documents together are an accidental case study in the gap between how a system is designed to work and how it fails under an unexpected configuration change. Publishing both — the capability presentation and the failure postmortem — is a model of engineering transparency.

Spotify changed the order of some filters in their proxy, which seemed fine until every server on Earth crashed simultaneously, and then Kubernetes helpfully restarted them all into the same crash in a loop — which is either a distributed systems problem or a distributed systems feature depending on how you look at it.

TechLogStack — built at scale, broken in public, rebuilt by engineers


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).

Top comments (0)