DEV Community

Cover image for ๐Ÿ”ฅ Day 2: Understanding Spark Architecture - How Spark Executes Your Code Internally
Sandeep
Sandeep

Posted on

๐Ÿ”ฅ Day 2: Understanding Spark Architecture - How Spark Executes Your Code Internally

Welcome back to Day 2 of the 60-Day Spark Mastery Series.

Today, we dive into the core of Sparkโ€™s execution engine - an essential concept for every Data Engineer who wants to write efficient and scalable ETL pipelines.

Letโ€™s break down Spark architecture in a way that is simple, visual, and interview-friendly.

๐Ÿง  Why Learn Spark Architecture?

If you understand how Spark works internally, you can:

  • Write faster pipelines
  • Debug errors quickly
  • Reduce shuffle
  • Tune cluster performance

โš™๏ธ Spark Architecture (High-Level)

Spark has 3 major components:

  1. Driver Program : This is the "brain" of your Spark application.

The driver:

  • Creates SparkSession
  • Builds logical plan (DAG)
  • Converts transformations into stages/tasks
  • Manages metadata
  • Talks to cluster manager

If the driver crashes โ†’ the entire application stops.

This is why we never use collect() on huge datasets - it overloads the driver.

  1. Executors : These are worker processes distributed across the cluster.

Executors:

  • Execute tasks in parallel
  • Store data in memory (RDD/DataFrame cache)
  • Write shuffle data
  • Report progress back to the driver

Executors die when your Spark application ends.

If you allocate:
4 executors
4 cores per executor
โ†’ You get 16 parallel task slots.

3. Cluster Manager

This system allocates machines to Spark.

Spark supports:

Manager Usage

Standalone - Local clusters
YARN - Hadoop ecosystem
Kubernetes - Cloud-native Spark
Databricks - Managed Spark service

Cluster manager = Resource allocator, not responsible for task scheduling.
Task scheduling is done by the driver.

๐Ÿ” Spark Execution Process: Simplified

Example code:

df = spark.read.csv("sales.csv")
filtered = df.filter(df.amount > 1000)
result = filtered.groupBy("category").count()
result.show()
Enter fullscreen mode Exit fullscreen mode

Step 1: You write code
Driver receives commands.

Step 2: Build Logical Plan (DAG)
Spark builds a series of transformations.

Step 3: Optimize Plan
Catalyst optimizer rewrites query for maximum performance.

Step 4: Convert to Physical Plan (Stages)
Stages break at Shuffle boundaries.

Step 5: Assign Tasks
Stages โ†’ tasks = smallest unit of work

Step 6: Executors Run Tasks
Parallel execution across cluster nodes.

Step 7: Results โ†’ Driver
.show() displays results on notebook/terminal.

๐ŸŒ‰ Understanding Stages & Tasks

๐Ÿ”น Stage : A group of tasks that can run in parallel without shuffle.

Example transformations that cause shuffle:

  • groupBy
  • join
  • reduceByKey

๐Ÿ”น Task : The unit of execution run by each executor.

If you have 100 partitions โ†’ Spark creates 100 tasks.

Common Spark Architecture Mistakes by Beginners

  • Using .collect() on large datasets
  • Repartitioning unnecessarily
  • Not broadcasting small lookup tables
  • Random executor memory allocation
  • Running heavy Python UDFs on large data

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)