DEV Community

Cover image for πŸ”₯ Day 4: RDD Internals - Partitions, Shuffles & Repartitioning Demystified
Sandeep
Sandeep

Posted on

πŸ”₯ Day 4: RDD Internals - Partitions, Shuffles & Repartitioning Demystified

Welcome to Day 4 of the Spark Mastery Series.

Yesterday we learned RDD basics. Today we go deeper into partitions, shuffles, coalesce, repartition, and persistenceβ€”core concepts that define Spark performance.

⚑ 1. Understanding Partitions
A partition is Spark’s basic unit of parallel processing.

Think of partitions like:

  • Slices of a pizza
  • Each slice handled by one executor core
  • More partitions = more parallel workers

Why partitions matter?

  • Too few partitions β†’ cluster underutilized
  • Too many partitions β†’ scheduler overhead

Default number of partitions:
`parallelize() β†’ spark.default.parallelism (usually number of cores Γ— 2)

File reads β†’ based on block size`

Check partitions:

rdd.getNumPartitions()
Enter fullscreen mode Exit fullscreen mode

πŸ” 2. Narrow vs Wide Transformations (The Real Reason Your Jobs Are Slow)

Narrow transformations:

  • No data movement
  • No shuffle
  • Faster

Examples: map, filter, union

Wide transformations:

  • Data movement between executors
  • Causes shuffle
  • Creates new stage

Examples: reduceByKey, groupByKey, join, distinct

πŸ”₯ 3. Shuffle β€” Spark’s Most Expensive Operation

During shuffle, Spark:

  • Writes data to disk
  • Transfers it over network
  • Reorganizes partitions

This is why shuffle-heavy jobs run slow. Huge companies spend millions reducing shuffle.

πŸ”„ 4. Repartition vs Coalesce
This is one of the most misunderstood concepts.

Repartition:

  • Used to increase OR decrease partitions
  • Causes full shuffle
  • Data gets evenly distributed
  • Good for large operations like joins
df2 = df.repartition(50)
Enter fullscreen mode Exit fullscreen mode

When to use?

  • Before joins
  • Before large aggregations
  • When dealing with skew

Coalesce:

  • Used to reduce partitions only
  • No shuffle
  • Much faster than repartition
  • Moves minimal data
df2 = df.coalesce(5)
Enter fullscreen mode Exit fullscreen mode

When to use?

  • Writing to small number of output files
  • Improving file compactness
  • When merging small partitions

πŸ“¦ 5. Persistence & Caching-Boosting Performance

Spark recomputes transformations unless cached.

Example:

processed = rdd.map(...).filter(...)
processed.persist()
processed.count()
processed.collect()
Enter fullscreen mode Exit fullscreen mode

Without persist β†’ Spark computes twice
With persist β†’ second action reads from cache

🧠 Summary
Today we learned:

  • How partitions work
  • What causes shuffle
  • Difference between narrow and wide transformations
  • When to use repartition vs coalesce
  • How caching helps performance

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)