Welcome back to Day 2 of the 60-Day Spark Mastery Series.
Today, we dive into the core of Sparkโs execution engine - an essential concept for every Data Engineer who wants to write efficient and scalable ETL pipelines.
Letโs break down Spark architecture in a way that is simple, visual, and interview-friendly.
๐ง Why Learn Spark Architecture?
If you understand how Spark works internally, you can:
- Write faster pipelines
- Debug errors quickly
- Reduce shuffle
- Tune cluster performance
โ๏ธ Spark Architecture (High-Level)
Spark has 3 major components:
- Driver Program : This is the "brain" of your Spark application.
The driver:
- Creates SparkSession
- Builds logical plan (DAG)
- Converts transformations into stages/tasks
- Manages metadata
- Talks to cluster manager
If the driver crashes โ the entire application stops.
This is why we never use collect() on huge datasets - it overloads the driver.
- Executors : These are worker processes distributed across the cluster.
Executors:
- Execute tasks in parallel
- Store data in memory (RDD/DataFrame cache)
- Write shuffle data
- Report progress back to the driver
Executors die when your Spark application ends.
If you allocate:
4 executors
4 cores per executor
โ You get 16 parallel task slots.
3. Cluster Manager
This system allocates machines to Spark.
Spark supports:
Manager Usage
Standalone - Local clusters
YARN - Hadoop ecosystem
Kubernetes - Cloud-native Spark
Databricks - Managed Spark service
Cluster manager = Resource allocator, not responsible for task scheduling.
Task scheduling is done by the driver.
๐ Spark Execution Process: Simplified
Example code:
df = spark.read.csv("sales.csv")
filtered = df.filter(df.amount > 1000)
result = filtered.groupBy("category").count()
result.show()
Step 1: You write code
Driver receives commands.
Step 2: Build Logical Plan (DAG)
Spark builds a series of transformations.
Step 3: Optimize Plan
Catalyst optimizer rewrites query for maximum performance.
Step 4: Convert to Physical Plan (Stages)
Stages break at Shuffle boundaries.
Step 5: Assign Tasks
Stages โ tasks = smallest unit of work
Step 6: Executors Run Tasks
Parallel execution across cluster nodes.
Step 7: Results โ Driver
.show() displays results on notebook/terminal.
๐ Understanding Stages & Tasks
๐น Stage : A group of tasks that can run in parallel without shuffle.
Example transformations that cause shuffle:
- groupBy
- join
- reduceByKey
๐น Task : The unit of execution run by each executor.
If you have 100 partitions โ Spark creates 100 tasks.
Common Spark Architecture Mistakes by Beginners
- Using .collect() on large datasets
- Repartitioning unnecessarily
- Not broadcasting small lookup tables
- Random executor memory allocation
- Running heavy Python UDFs on large data
Follow for more such content. Let me know if I missed anything in comments. Thank you!!
Top comments (0)