Sandeep

Posted on Dec 2, 2025

🔥 Day 2: Understanding Spark Architecture - How Spark Executes Your Code Internally

#python #dataengineering #spark #bigdata

Welcome back to Day 2 of the 60-Day Spark Mastery Series.

Today, we dive into the core of Spark’s execution engine - an essential concept for every Data Engineer who wants to write efficient and scalable ETL pipelines.

Let’s break down Spark architecture in a way that is simple, visual, and interview-friendly.

🧠 Why Learn Spark Architecture?

If you understand how Spark works internally, you can:

Write faster pipelines
Debug errors quickly
Reduce shuffle
Tune cluster performance

⚙️ Spark Architecture (High-Level)

Spark has 3 major components:

Driver Program : This is the "brain" of your Spark application.

The driver:

Creates SparkSession
Builds logical plan (DAG)
Converts transformations into stages/tasks
Manages metadata
Talks to cluster manager

If the driver crashes → the entire application stops.

This is why we never use collect() on huge datasets - it overloads the driver.

Executors : These are worker processes distributed across the cluster.

Executors:

Execute tasks in parallel
Store data in memory (RDD/DataFrame cache)
Write shuffle data
Report progress back to the driver

Executors die when your Spark application ends.

If you allocate:
4 executors
4 cores per executor
→ You get 16 parallel task slots.

3. Cluster Manager

This system allocates machines to Spark.

Spark supports:

Manager Usage

Standalone - Local clusters YARN - Hadoop ecosystem Kubernetes - Cloud-native Spark Databricks - Managed Spark service

Cluster manager = Resource allocator, not responsible for task scheduling.
Task scheduling is done by the driver.

🔁 Spark Execution Process: Simplified

Example code:

df = spark.read.csv("sales.csv")
filtered = df.filter(df.amount > 1000)
result = filtered.groupBy("category").count()
result.show()

Step 1: You write code
Driver receives commands.

Step 2: Build Logical Plan (DAG)
Spark builds a series of transformations.

Step 3: Optimize Plan
Catalyst optimizer rewrites query for maximum performance.

Step 4: Convert to Physical Plan (Stages)
Stages break at Shuffle boundaries.

Step 5: Assign Tasks
Stages → tasks = smallest unit of work

Step 6: Executors Run Tasks
Parallel execution across cluster nodes.

Step 7: Results → Driver
.show() displays results on notebook/terminal.

🌉 Understanding Stages & Tasks

🔹 Stage : A group of tasks that can run in parallel without shuffle.

Example transformations that cause shuffle:

groupBy
join
reduceByKey

🔹 Task : The unit of execution run by each executor.

If you have 100 partitions → Spark creates 100 tasks.

Common Spark Architecture Mistakes by Beginners

Using .collect() on large datasets
Repartitioning unnecessarily
Not broadcasting small lookup tables
Random executor memory allocation
Running heavy Python UDFs on large data

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

DEV Community

🔥 Day 2: Understanding Spark Architecture - How Spark Executes Your Code Internally

Top comments (0)