DEV Community

Tomoya Oda
Tomoya Oda

Posted on • Updated on

Spark on AWS Glue: Performance Tuning 3 ( Impact of Partition Quantity)

This is a continuation of my previous posts as follows.

  1. Spark on AWS Glue: Performance Tuning 1 (CSV vs Parquet)

  2. Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)

  3. Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity)

  4. Spark on AWS Glue: Performance Tuning 4 (Spark Join)

  5. Spark on AWS Glue: Performance Tuning 5 (Using Cache)

Impact of Partition Quantity

we will compare the speeds of three different partition numbers: 1, default (unspecified), and 300.
And also we will compare with and without shuffling.

DataFrame Preparation

df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
one_part_df = df.coalesce(1)
print(one_part_df.rdd.getNumPartitions())
one_part_df.count()

part_df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
print(part_df.rdd.getNumPartitions())
part_df.count()

df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
part_300_df = df.repartition(300)
print(part_300_df.rdd.getNumPartitions())
part_300_df.count()
Enter fullscreen mode Exit fullscreen mode
1
94
300
Enter fullscreen mode Exit fullscreen mode

By default (unspecified) 94 partitions were read.

Without shuffling

with timer('one part filter'):
    result = one_part_df.filter(one_part_df['request_processing_time'] < 0.0008).count()
    print(result)

with timer('part filter'):
    result = part_df.filter(part_df['request_processing_time'] < 0.0008).count()
    print(result)

with timer('part 300 filter'):
    result = part_300_df.filter(part_300_df['request_processing_time'] < 0.0008).count()
    print(result)
Enter fullscreen mode Exit fullscreen mode
9
[one part filter] done in 45.5252 s
9
[part filter] done in 1.4579 s
9
[part 300 filter] done in 3.5410 s
Enter fullscreen mode Exit fullscreen mode

94 partitions is the fastest.

With shuffling

with timer('one part shuffle'):
    result = one_part_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

with timer('part shuffle'):
    result = part_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

with timer('part 300 shuffle'):
    result = part_300_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

Enter fullscreen mode Exit fullscreen mode
9
[one part shuffle] done in 78.1068 s
9
[part shuffle] done in 2.6624 s
9
[part 300 shuffle] done in 12.2829 s
Enter fullscreen mode Exit fullscreen mode

Again, 94 partitions is the fastest.

Summary

  • Basically, a larger number of partitions is better than a smaller number.
  • "just right" number of partitions is the most efficient.

Top comments (0)