TechLogStack

Posted on May 24 • Originally published at techlogstack.com on May 24

Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

#devops #kubernetes #reliability #webdev

3h 27m outage — 12:18 UTC to 15:45 UTC, April 16 2025
675M monthly active users affected globally
48,000+ peak Downdetector reports
0 regions with staged rollout — applied globally simultaneously
Root cause: Envoy max heap configured higher than K8s memory limit
Fix: capacity increase reduced per-instance memory below the kill threshold

On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their Envoy Proxy (an open-source edge proxy that receives all incoming user traffic before distributing it to backend services) perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed — and then the restart loop began, powered by Kubernetes itself, killing each new server as fast as it came back up. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.

The Story

This crash happened simultaneously on all Envoy instances.

— Spotify Engineering, Incident Report: Spotify Outage on April 16, 2025

There is a specific kind of engineering failure that hurts more than the others: the change that was reviewed, discussed, and approved — the change the team looked at together and agreed was fine. Spotify's perimeter is the first layer of software that receives traffic from every user worldwide — every stream request, every search, every login. To extend Envoy's capabilities, Spotify develops its own custom filters — plugins that handle rate limiting, authentication, and other cross-cutting concerns. These filters execute in a defined order. The April 16 change altered that order. The new sequence triggered a latent bug in one of the custom filters: a code path that had existed harmlessly, triggered only when the filter received control at that specific position. Envoy crashed. Not one instance, not one region. All of them.

The Death Loop: Why the Restart Made Things Worse

An Envoy crash is normally survivable — Kubernetes detects the failed pod and starts a replacement. But client-side retry logic (every user's app retrying its failed request) created an unprecedented traffic spike onto each new instance. Each new Envoy started, received the full flood of retry traffic, consumed more memory than the Kubernetes memory limit (the maximum memory a pod is allowed to use — when exceeded, K8s automatically terminates it), and was killed. A new instance started. The same thing happened. The loop repeated — powered by Kubernetes itself — for hours.

Problem

12:18 UTC — Filter Reorder Applied Globally, All Envoy Instances Crash

The change to Envoy filter execution order was applied simultaneously to all cloud regions worldwide. The new order activated a latent bug in a custom Spotify filter. Every Envoy instance on Spotify's networking perimeter crashed at the same moment. Alarms fired two minutes later as the traffic drop became measurable.

Cause

The Hidden Misconfiguration: Heap Larger Than the K8s Memory Limit

The traffic flood from client retries exposed a misconfiguration that had existed undetected: Envoy's max heap size was configured higher than the Kubernetes memory limit for the pod. Under normal traffic, Envoy never approached its heap limit and the misconfiguration was invisible. Under the retry flood, each new instance immediately exceeded the K8s limit and was killed. This turned a recoverable crash into an infinite restart loop.

Solution

Asia Pacific Stayed Up — and Explained Everything

Asia Pacific was the only region unaffected. Engineers investigated why. The answer: lower traffic volume at that time of day (timezone difference) meant APAC Envoy instances never received enough retry traffic to exceed the K8s memory limit. The asymmetry proved the hypothesis: the death loop was memory-limit driven, not bug-driven. Fix the memory headroom, break the loop.

Result

15:45 UTC — Death Loop Broken, Full Recovery

Increasing total perimeter server capacity gave each new Envoy instance enough headroom to stay under the K8s memory limit even while absorbing the retry traffic flood. The death loop broke. EU recovered at 14:20 UTC, US at 15:10 UTC, full normalisation at 15:40 UTC. Total duration: 3 hours 27 minutes.

The Fix

The Misconfiguration Nobody Noticed — Until the Crash

The root problem was that Envoy's max heap size was set higher than the Kubernetes memory limit for the pod. In normal operation, Envoy memory usage never approached its heap maximum — the misconfiguration was invisible. The retry flood was the first event extreme enough to push instances over the K8s limit and trigger the kill cycle.

3h 27m — Total outage duration, 12:18 to 15:45 UTC
675M — Users affected; 263M paying Premium subscribers — no perimeter differentiation by tier
48,000+ — Peak Downdetector reports (active reporters only; actual affected users in the hundreds of millions)
0 — Regions with staged rollout before full deployment

# THE MISCONFIGURATION: Envoy heap limit higher than K8s memory limit

# Kubernetes pod resource specification (simplified)
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: envoy
    resources:
      requests:
        memory: "2Gi"
      limits:
        memory: "3Gi"  # K8s will OOMKill the pod above this

# Envoy overload manager configuration (simplified)
overload_manager:
  resource_monitors:
  - name: envoy.resource_monitors.fixed_heap
    typed_config:
      max_heap_size_bytes: 4294967296  # 4GB — HIGHER than K8s 3GB limit!

# Why this is catastrophic:
# - K8s kills at 3GB memory usage
# - Envoy's own safety valve triggers at 95% of 4GB = 3.84GB
# - K8s limit is hit BEFORE Envoy's graceful degradation kicks in
# - Under normal load: Envoy peaks at ~1.5GB — misconfiguration invisible
# - Under retry flood: Envoy climbs past 3GB → OOMKill → restart → repeat

# IMMEDIATE FIX: Increase perimeter server count
# More servers = retry traffic spread across more instances
# = each instance stays under 3GB = K8s doesn't kill = loop breaks

# PERMANENT FIX: Align heap config with K8s memory limit
# max_heap_size_bytes: 2684354560  # 2.5GB — safely below K8s 3GB limit

Why Increasing Capacity Fixed the Loop

The K8s memory limit was fixed. The retry traffic load was fixed (determined by user behaviour). The only variable Spotify could change quickly was the number of Envoy instances sharing that retry load. More instances → each instance receives a smaller share of the flood → memory stays below the K8s limit → K8s doesn't kill it → stable. The underlying misconfiguration (heap > K8s limit) was fixed separately afterward as permanent remediation.

Spotify's four post-incident commitments:

Fix the filter bug that caused the initial crash on filter reorder
Fix the heap/K8s limit mismatch — align Envoy config with pod resource limits
Staged perimeter rollouts — regional validation before global deployment
Improved monitoring — detect configuration issues earlier in the failure chain

Incident timeline:

Time (UTC)	Event	Status
12:18	Filter reorder applied; all Envoy instances crash	🔴 Global failure
12:20	Alarms fire on traffic drop; death loop running	🔴 Engineers paged
12:28	Escalated; only APAC serving traffic	🔴 Incident declared
~13:xx	Root cause identified via APAC asymmetry	🟡 Diagnosis complete
14:20	EU fully recovered	🟡 Partial recovery
15:10	US fully recovered	🟡 Partial recovery
15:40	All regions normalised	🟢 Full recovery

Architecture

Spotify's networking perimeter places Envoy Proxy as the outermost layer — the first software that receives every user request, regardless of what backend it is destined for. When every Envoy instance crashes simultaneously, no user request can reach any backend service. The entire platform goes dark regardless of whether individual backend services remain healthy. This is the shared fate property of perimeter architecture: a perimeter failure has a blast radius of every service, every user, every region simultaneously.

Spotify's Perimeter Architecture: Envoy as the Universal Traffic Gateway

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Three-Layer Failure Cascade: From Filter Bug to Death Loop

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The APAC Diagnostic: How One Region Proved the Root Cause

When engineers observed APAC was unaffected, they had two candidate hypotheses: (A) the filter bug is region-specific, or (B) the death loop is traffic-intensity dependent. Investigation confirmed (B): APAC runs identical filter configuration — lower traffic meant less retry amplification, meaning per-instance memory pressure never reached the K8s limit. This asymmetry transformed a hard debugging problem ("why is the loop happening?") into a tractable one ("what's different about APAC?") and pointed directly at the memory-limit misconfiguration.

Configuration drift: why this existed undetected for months

The Envoy heap/K8s limit misconfiguration almost certainly existed long before April 16. It was never caught because Envoy memory usage never reached the dangerous threshold under normal traffic. This is a common pattern: configuration mismatches that are only dangerous under abnormal load go undetected indefinitely in systems where abnormal load doesn't occur. The misconfiguration didn't cause the outage — the filter bug did. But it was what turned a recoverable crash into a multi-hour global outage. Auditing resource limit configurations against actual peak usage, including synthetic stress tests, is the practice that catches these before they detonate.

Lessons

'Low risk' is not a substitute for staged rollout at the perimeter. A change's risk profile determines what validation it needs — it doesn't override the need for validation. The filter reorder was simple; the blast radius of failure was total. Stage perimeter changes by region and monitor before expanding.
Latent bugs (code defects harmless until a specific triggering condition occurs) that depend on execution context cannot be caught by tests that don't vary that context. A filter test suite that exercises filters in their original order will never discover a bug that only manifests in a different order. When making ordering or sequencing changes, test explicitly in the new order.
Audit resource limit configurations against actual and stress-test peak usage regularly. Mismatches between Envoy heap size and Kubernetes memory limits are invisible until a load event forces memory beyond the limit. A misconfiguration harmless for months can become catastrophic under the right load spike.
Client-side retry logic turns total simultaneous failures into traffic amplification events. Design retry logic with awareness of this: exponential backoff with jitter spreads retries over time; circuit breakers prevent retries when failure rate exceeds a threshold; retry budgets limit total retry volume per client.
When one region survives an outage that hits all others, that region is your fastest path to root cause. APAC's survival was a controlled experiment running in production. Its configuration was identical; its traffic was lower. The asymmetry proved the diagnosis. Systematically compare surviving regions against failed ones — it shortens MTTR.

Engineering Glossary

Client-side retry logic — application behaviour where the client automatically retries failed requests after a brief delay. Designed to handle transient failures, but capable of amplifying load during sustained simultaneous failures by converting each failed request into one or more retry requests.

Death loop — an informal term for an infinite restart cycle where a pod crashes, Kubernetes restarts it, and the replacement crashes for the same reason. Powered by K8s restart behaviour combined with a condition (here: retry flood + heap misconfiguration) that guarantees each replacement fails.

Envoy Proxy — an open-source, high-performance edge proxy originally built at Lyft, widely used as the networking perimeter layer in distributed systems. Receives all incoming user traffic before distributing it to backend services.

Filter chain — the ordered sequence of processing modules (filters) that each request passes through in an Envoy proxy instance. Each filter can inspect, modify, or reject the request before passing it to the next filter. Order is semantically meaningful.

Latent bug — a code defect that exists in production but is harmless until a specific triggering condition occurs. Undetectable by standard testing if the triggering condition is rare or contextual.

OOMKill — Out-Of-Memory Kill. The Kubernetes mechanism that terminates a pod when it exceeds its configured memory limit, to protect other workloads on the node from memory starvation.

Shared fate system — an architecture where all dependent services rise and fall with a shared component. Spotify's Envoy perimeter is a shared fate system: if it fails, every backend service becomes unreachable regardless of whether those services are healthy.

Staged rollout — deploying a change to a subset of infrastructure (one region, one cluster) and validating behaviour before expanding to the full fleet. The safety mechanism absent from the April 16 deployment.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community