Pranav Bhasker

Posted on Feb 15 • Originally published at Medium

Postmortem: Eliminating OOM Failures in Spark on Kubernetes (Azure) After Cloud Migration

#dataengineering #kubernetes #cloud #spark

Originally published on Medium (Towards Data Engineering)

A production case study on debugging persistent memory failures in large-scale Spark batch workloads

Abstract

OutOfMemory (OOM) failures in Apache Spark are rarely caused by a single misconfiguration. In production environments — especially when Spark runs on Kubernetes in the cloud — memory failures often emerge from the interaction between scheduling, node sizing, shuffle behavior, and storage configuration.

This post documents a real production incident that occurred after migrating Spark batch workloads from on-premises to Azure Kubernetes Service (AKS). Jobs processing large datasets began experiencing recurring executor OOM kills, while smaller jobs ran fine. The failures were eliminated by adding pod anti-affinity rules and reconfiguring Spark’s temporary file (shuffle spill) settings from memory-backed ‘tmpfs’ to disk-backed ephemeral storage. The result was six months of stable pipelines with zero OOM incidents and significantly improved throughput using the same cluster footprint.

The Problem That Wouldn’t Go Away

Our Spark batch pipelines had been running reliably for months on-premises, processing terabytes of event data daily. After migrating the same workloads to Azure Kubernetes Service (AKS), some jobs that processed large datasets began failing with:

java.lang.OutOfMemoryError: Java heap space

Notably, smaller jobs continued to run without issues. The failures were intermittent at first but became frequent during shuffle-heavy stages in data-intensive jobs. Restarts temporarily masked the issue, but OOMs returned consistently under load for these workloads.

The root cause was not a code regression — the same jobs ran fine on-premises. The issue was that Azure-side configurations had been accidentally set differently during migration, fundamentally changing how Kubernetes managed memory and storage for shuffle and temporary data.

Root Cause Analysis

We identified three compounding factors, with the critical issue being the combination of the last two:

Shuffle-driven memory pressure in large data jobs
Large shuffle stages in data-intensive jobs triggered sharp executor memory spikes.

Missing pod anti-affinity configuration
Without explicit affinity rules, Kubernetes scheduled all executors onto a single node, amplifying memory pressure.

tmpfs=true for shuffle spill (the primary culprit)
Shuffle spill paths were configured with emptyDir.medium: Memory, causing Kubernetes to use memory-backed storage for shuffle and temporary data instead of disk.

The combination of missing affinity settings and tmpfs=true was catastrophic. All executors landed on the same node, and their shuffle spill data consumed node memory instead of disk. This worked fine on-premises where these configurations were different, but on Azure it created a perfect storm for OOM failures in jobs processing large datasets.

Kubernetes Scheduling: Pod Anti-Affinity

The Problem

AKS was placing all executor pods on the same node. During shuffle-heavy stages, stacking all executors on a single node saturated memory and I/O, triggering kernel OOM kills.

The Fix

We introduced pod anti-affinity to spread executors across nodes:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: spark-role
                operator: In
                values:
                  - executor
          topologyKey: kubernetes.io/hostname

Result: Executor placement became more even, and node-level memory spikes dropped significantly.

Right-Sizing Node Capacity (Azure)

While nodes were provisioned with 64GB RAM, OS and kubelet overhead reduced the usable memory available to workloads. When all executors landed on the same node, memory pressure exceeded safe thresholds.

Fixes implemented:

Ensured executor requests + limits left sufficient headroom for OS + kubelet
Enforced anti-affinity to prevent executor stacking.
Validated node memory availability under peak shuffle workloads.

This stabilized memory availability at the node level.

tmpfs and Shuffle Spill: Root Cause and Fix

During migration, shuffle spill paths were configured with memory-backed emptyDir volumes (tmpfs=true):

# BEFORE (problematic): tmpfs=true causes memory-backed storage
volumes:
  - name: spark-local-dir
    emptyDir:
      medium: Memory   # tmpfs=true (RAM-backed)

spark.local.dir=/tmp/spark-local

This configuration caused Kubernetes to use memory-backed storage for all shuffle and temporary data. Instead of spilling shuffle data to disk during large shuffles, Spark wrote everything to node memory (tmpfs). When combined with the missing pod anti-affinity settings that allowed all executors to co-locate on the same node, this dramatically increased OOM risk — especially for jobs processing large datasets.

On-premises, this configuration was different, which is why the same jobs ran fine there.

The Final Fix

We changed emptyDir to use ephemeral disk-backed storage by removing the medium: Memory specification

AFTER (fixed): tmpfs=false enables disk-backed ephemeral storage

volumes:
  - name: spark-local-dir
    emptyDir: {}   # tmpfs=false (node ephemeral disk)

This single change removed memory-backed spill pressure entirely. Kubernetes now used disk for shuffle and temporary data instead of consuming node memory. Combined with pod anti-affinity, this was the most impactful fix in stabilizing executor memory behavior for large data jobs.

Spark Shuffle Tuning

We reduced per-task shuffle memory pressure by tuning partitions:

spark.sql.shuffle.partitions=200

This balanced parallelism and memory usage for our workload scale without introducing excessive scheduling overhead.

Monitoring and Validation

We instrumented production metrics for:

Executor heap usage
Shuffle read/write volume
Pod eviction rates
Node memory utilization
Job failure frequency

Alerts added:

Executor OOM kill events
Node memory utilization thresholds
Job retry rate increases Spark UI, AKS metrics, and Kubernetes dashboards were used to validate improvements under peak load.

Results

Before

Frequent executor OOM kills
Job failures during shuffle-heavy stages
Unpredictable runtimes
Increased on-call incidents

After

Zero OOM incidents over six months
Stable executor memory behavior
Predictable batch completion times
Improved pipeline reliability without increasing cluster size

Lessons Learned

Cloud migrations can silently change storage and memory semantics — configurations that worked on-premises may fail in the cloud
tmpfs=true causes Kubernetes to use memory-backed storage for shuffle spill, which can be catastrophic for memory-intensive workloads processing large datasets
Missing pod anti-affinity rules allow executor co-location, which amplifies memory pressure when combined with memory-backed storage
The combination of tmpfs=true and missing affinity is especially dangerous — each problem compounds the other
Spark OOM issues are often infrastructure configuration problems in disguise, not application code issues

Production Checklist

Enforce pod anti-affinity for executors
Validate Azure node memory headroom
Avoid tmpfs (emptyDir.medium: Memory) for shuffle spill in large jobs
Use disk-backed ephemeral storage for shuffle spill
Tune shuffle partitions for workload scale
Monitor node-level memory, not just executor heap
Validate config parity during cloud migration

Conclusion

This incident revealed how two infrastructure-level misconfigurations introduced during cloud migration can combine to destabilize otherwise healthy Spark workloads. Setting tmpfs=true (emptyDir.medium: Memory) caused Kubernetes to use memory-backed storage for shuffle and temporary data, while missing pod anti-affinity rules allowed all executors to co-locate on the same node. Together, these configurations created catastrophic memory pressure for jobs processing large datasets—workloads that ran perfectly fine on-premises with different configurations.

By reverting to tmpfs=false (disk-backed ephemeral storage) and adding explicit pod anti-affinity, we eliminated persistent OOM failures and restored production stability for all workloads, regardless of data size.

Lets discuss

Have you experienced similar OOM issues with Spark on Kubernetes?

I'd love to hear about:

Your debugging approaches for memory failures
Cloud migration gotchas you've encountered
Alternative solutions you've implemented

Drop your experiences in the comments below.

DEV Community