partition pruning

#data #database #dataengineering #performance

Working with Big Data often presents the challenge of slow query results due to the overhead of scanning massive datasets. Optimization involves more than just how you read or aggregate data; for high-performance scanning, data must be organized in a way that the Spark engine can consume efficiently.
This is where partition pruning becomes essential. If data is well-partitioned within storage systems like HDFS, S3, or ADLS, Spark queries will only scan the specific partition folders required, significantly reducing processing time.

you can see in below code, instead of giant file, spark create directory with hierarchy.

df.write.partitionBy("Year", "Month").parquet("/data/consumer")

Physical Result:

/data/sales/Year=2023/Month=01/

/data/sales/Year=2023/Month=02/

/data/sales/Year=2024/Month=01/

Partition Pruning happens automatically when you query the data.
If you run:
df = spark.read.parquet("/data/consumer").filter(col("Year") == 2024)

Spark looks at the query, looks at the folder structure, and says, "Okay, the user only wants 2024. I am going to completely ignore the 'Year=2023' folder. I won't even list the files inside it."

This can turn a 10TB scan into a 100GB scan instantly.

DEV Community

partition pruning

Top comments (0)