Welcome to Day 4 of the Spark Mastery Series.
Yesterday we learned RDD basics. Today we go deeper into partitions, shuffles, coalesce, repartition, and persistenceβcore concepts that define Spark performance.
β‘ 1. Understanding Partitions
A partition is Sparkβs basic unit of parallel processing.
Think of partitions like:
- Slices of a pizza
- Each slice handled by one executor core
- More partitions = more parallel workers
Why partitions matter?
- Too few partitions β cluster underutilized
- Too many partitions β scheduler overhead
Default number of partitions:
`parallelize() β spark.default.parallelism (usually number of cores Γ 2)
File reads β based on block size`
Check partitions:
rdd.getNumPartitions()
π 2. Narrow vs Wide Transformations (The Real Reason Your Jobs Are Slow)
Narrow transformations:
- No data movement
- No shuffle
- Faster
Examples: map, filter, union
Wide transformations:
- Data movement between executors
- Causes shuffle
- Creates new stage
Examples: reduceByKey, groupByKey, join, distinct
π₯ 3. Shuffle β Sparkβs Most Expensive Operation
During shuffle, Spark:
- Writes data to disk
- Transfers it over network
- Reorganizes partitions
This is why shuffle-heavy jobs run slow. Huge companies spend millions reducing shuffle.
π 4. Repartition vs Coalesce
This is one of the most misunderstood concepts.
Repartition:
- Used to increase OR decrease partitions
- Causes full shuffle
- Data gets evenly distributed
- Good for large operations like joins
df2 = df.repartition(50)
When to use?
- Before joins
- Before large aggregations
- When dealing with skew
Coalesce:
- Used to reduce partitions only
- No shuffle
- Much faster than repartition
- Moves minimal data
df2 = df.coalesce(5)
When to use?
- Writing to small number of output files
- Improving file compactness
- When merging small partitions
π¦ 5. Persistence & Caching-Boosting Performance
Spark recomputes transformations unless cached.
Example:
processed = rdd.map(...).filter(...)
processed.persist()
processed.count()
processed.collect()
Without persist β Spark computes twice
With persist β second action reads from cache
π§ Summary
Today we learned:
- How partitions work
- What causes shuffle
- Difference between narrow and wide transformations
- When to use repartition vs coalesce
- How caching helps performance
Follow for more such content. Let me know if I missed anything in comments. Thank you!!
Top comments (0)