🚀 How Pinterest Runs Apache Spark on Kubernetes at Scale with Moka

#devops #news #aws #eks

In July 2025, Pinterest unveiled its next-generation data processing platform, Moka, designed to run Apache Spark on Kubernetes using Amazon EKS. It’s a major leap away from the legacy Hadoop + YARN setup—and a great case study in modern, scalable data engineering.

Here’s what’s under the hood and why it matters.

🧱 Moka Architecture Overview

Pinterest’s new stack brings together:

Spark on Kubernetes (EKS): Jobs run in isolated containers, optimized for ARM (Graviton) and x86.
Archer: Pinterest’s internal job submission system that transforms Spark jobs into CRDs.
Apache YuniKorn: A batch scheduler for fairness, preemption, and workload prioritization.
Amazon S3 + Celeborn: Storage and shuffle services optimized for remote/distributed use.
Moka UI: A read-only dashboard for job status, logs, and performance metrics.

⚙️ How Spark Jobs Flow in Moka

Workflow Definition: Jobs begin in Airflow (a Pinterest system called Spinner).
Job Submission: Spinner sends job metadata to Archer.
CRD Generation: Archer turns each job into a SparkApplication CRD.
Scheduling: YuniKorn places the job on available Kubernetes nodes.
Execution: Spark pods run the job, with data stored in S3 and shuffling handled by Celeborn.
Observability: Developers monitor everything via Spark UI, Moka UI, and Statsboard dashboards.

🎯 Key Benefits

Multi-tenancy: Team workloads are isolated, yet centrally managed.
Resource Efficiency: ARM-based compute and Kubernetes autoscaling save cost.
Developer Velocity: Devs write declarative Spark jobs; infra handles the rest.
Scalability: Jobs scale automatically across dynamic infrastructure.
Monitoring: Central UI shows logs, job history, and performance data.

💡 Why It’s a Big Deal

Pinterest isn’t just moving to containers—they’re redefining what a data platform can look like in the cloud-native era.

The result? More reliable Spark workloads, better cost efficiency, and faster development—without sacrificing observability or control.

📚 Resources

InfoQ coverage

👋 Over to You

Have you tried Spark on Kubernetes? What tools are you using for scheduling and shuffle performance? Would love to hear how others are solving the same scale problems.

Let’s chat in the comments 👇