This is a continuation of my previous posts as follows.
Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)
Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity)
Impact of Partition Quantity
we will compare the speeds of three different partition numbers: 1, default (unspecified), and 300.
And also we will compare with and without shuffling.
DataFrame Preparation
df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
one_part_df = df.coalesce(1)
print(one_part_df.rdd.getNumPartitions())
one_part_df.count()
part_df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
print(part_df.rdd.getNumPartitions())
part_df.count()
df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
part_300_df = df.repartition(300)
print(part_300_df.rdd.getNumPartitions())
part_300_df.count()
1
94
300
By default (unspecified) 94 partitions were read.
Without shuffling
with timer('one part filter'):
result = one_part_df.filter(one_part_df['request_processing_time'] < 0.0008).count()
print(result)
with timer('part filter'):
result = part_df.filter(part_df['request_processing_time'] < 0.0008).count()
print(result)
with timer('part 300 filter'):
result = part_300_df.filter(part_300_df['request_processing_time'] < 0.0008).count()
print(result)
9
[one part filter] done in 45.5252 s
9
[part filter] done in 1.4579 s
9
[part 300 filter] done in 3.5410 s
94 partitions is the fastest.
With shuffling
with timer('one part shuffle'):
result = one_part_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
print(result)
with timer('part shuffle'):
result = part_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
print(result)
with timer('part 300 shuffle'):
result = part_300_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
print(result)
9
[one part shuffle] done in 78.1068 s
9
[part shuffle] done in 2.6624 s
9
[part 300 shuffle] done in 12.2829 s
Again, 94 partitions is the fastest.
Summary
- Basically, a larger number of partitions is better than a smaller number.
- "just right" number of partitions is the most efficient.
Top comments (0)