Sandeep

Posted on Dec 9, 2025 • Edited on Dec 10, 2025

Day 8: Accelerating Spark Joins - Broadcast, Shuffle Optimization & Skew Handling

#dataengineering #spark #bigdata #python

Welcome to Day 8 of the Spark Mastery Series.
Today we deep-dive into the topic that makes or breaks ETL pipelines:

Joins and performance optimization.

'If you master today’s concepts, you can improve Spark jobs from 2 hours → 10 minutes.'
Let’s begin.

🌟 1. Why Are Joins Slow in Spark?

When Spark performs a join, it often must shuffle data across executors so rows with the same join key end up in the same partition.

Shuffle includes:

Shuffles takes 80% of Spark’s execution time in badly optimized pipelines.

🌟 2. Broadcast Joins — The Fastest Join Strategy

If one dataset is small (< 50MB), broadcast join is the fastest possible way.

df_large.join(broadcast(df_small), "id")

Why it’s fast?

This can turn a shuffle join into a map-side join — extremely fast.

🌟 3. Repartitioning Before Join
If two DataFrames have different partitioning strategies, Spark shuffles both.

Solution:

df1 = df1.repartition("id")
df2 = df2.repartition("id")

joined = df1.join(df2, "id")

Why this helps:

🌟 4. Handling Skew — The Most Important Real-World Skill

Data skew happens when a handful of join keys contain most of the data.

Example:

"India" → 5 million records

"USA" → 200,000

"UK" → 10,000

This causes:

⭐ Solution 1: Salting Keys

df1 = df1.withColumn("salt", floor(rand() * 10))
df2 = df2.withColumn("salt", lit(0))

Now join on ["id", "salt"].

⭐ Solution 2: Broadcast the smaller table

⭐ Solution 3: Skew hint

df1.hint("skew").join(df2, "id")

🌟 5. Join Strategy Selection (What Spark Uses Internally)

SortMergeJoin

HashJoin

🌟 6. Real-World Example: Retail Sales ETL
You have:
sales table → 200M records
product table → 50k records

The correct join:

df = sales.join(broadcast(products), "product_id")

This alone reduces runtime by 10 to 20x.

🚀 Summary

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

DEV Community