narrow transformation
Operation within single partition
- no data movement accros the cluster
- no shuffle
- fast and cheaper
e.g. select, map, filter, withColumn, union
no shuffle required because row based transformation
wide transformation
required data to be redistribute across the partition
- data shuffle
- create new stage
- expensive
e.g. join, groupBy, orderBy, distinct, reduceByKey
result is calculated output of data, so shuffle required
Top comments (0)