Spark on AWS Glue: Performance Tuning 3 ( Impact of Partition Quantity)

#aws #glue #spark #performance

This is a continuation of my previous posts as follows.

Impact of Partition Quantity

we will compare the speeds of three different partition numbers: 1, default (unspecified), and 300.
And also we will compare with and without shuffling.

DataFrame Preparation

df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
one_part_df = df.coalesce(1)
print(one_part_df.rdd.getNumPartitions())
one_part_df.count()

part_df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
print(part_df.rdd.getNumPartitions())
part_df.count()

df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
part_300_df = df.repartition(300)
print(part_300_df.rdd.getNumPartitions())
part_300_df.count()

1
94
300

By default (unspecified) 94 partitions were read.

Without shuffling

with timer('one part filter'):
    result = one_part_df.filter(one_part_df['request_processing_time'] < 0.0008).count()
    print(result)

with timer('part filter'):
    result = part_df.filter(part_df['request_processing_time'] < 0.0008).count()
    print(result)

with timer('part 300 filter'):
    result = part_300_df.filter(part_300_df['request_processing_time'] < 0.0008).count()
    print(result)

9
[one part filter] done in 45.5252 s
9
[part filter] done in 1.4579 s
9
[part 300 filter] done in 3.5410 s

94 partitions is the fastest.

With shuffling

with timer('one part shuffle'):
    result = one_part_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

with timer('part shuffle'):
    result = part_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

with timer('part 300 shuffle'):
    result = part_300_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

9
[one part shuffle] done in 78.1068 s
9
[part shuffle] done in 2.6624 s
9
[part 300 shuffle] done in 12.2829 s

Again, 94 partitions is the fastest.

Summary

Basically, a larger number of partitions is better than a smaller number.
"just right" number of partitions is the most efficient.

DEV Community