Welcome to Day 10 of the Spark Mastery Series!
Todayβs topic is one of the biggest performance boosters in Spark ETL pipelines.
Most Spark beginners learn transformations but never learn how data should be stored for maximum performance.
Partitioning & Bucketing are the two most powerful tools for that.
Letβs master them.
π 1. Why Partitioning Matters in Spark
Partitioning is the process of splitting data into separate folders/files based on one or more columns.
Example:
df.write.partitionBy("year", "month").parquet("/sales")
This creates folders:
year=2024/month=01/
year=2024/month=02/
year=2024/month=03/
Benefits:
- Queries that filter on partition columns skip entire folders
- Less I/O
- Faster scans
- Lower compute cost
This technique is used in all Lakehouse architectures.
π 2. Repartition() vs Coalesce()
Repartition()
Used to increase or rebalance partitions.
df = df.repartition("customer_id")
β Even distribution
β Useful before joins
β Slow (shuffle required)
Coalesce()
Used to reduce partitions.
df = df.coalesce(5)
β No shuffle
β Faster writes
β Cannot increase partitions
π 3. When Should You Partition Your Data?
Partition when:
- You filter heavily on the same column
- You have time-based data
- You want faster analytics
Avoid partitioning when:
- Column has millions of unique values
- Files become extremely small (<1MB each)
π 4. What is Bucketing and Why Itβs Powerful?
Bucketing reduces shuffle for large-table joins.
df.write.bucketBy(20, "id").sortBy("id").saveAsTable("bucketed_users")
This creates 20 bucket files.
When you join two bucketed tables on the same key, Spark doesnβt need to shuffle!
Benefits:
- Faster joins
- Deterministic data distribution
- Better for high-cardinality columns
π 5. Partition vs Bucket β Which One Should You Use?
Use Partitioning when:
β Queries heavily filter on the column
β Time-series queries
β Data skipping is needed
Use Bucketing when:
β You want to speed up joins on large datasets
β High-cardinality join keys
β Combine with partitioning for massive datasets
π 6. Real-World Use Case (E-Commerce)
Sales data:
Partition by:
year, month, country
User table:
Bucket by
user_id
When joining:
Bucketed tables β fast joins
Partitioned tables β fast filters
This is exactly how Databricks Lakehouse architectures are built.
π Summary
We learned:
- What partitioning is
- What bucketing is
- Repartition vs coalesce
- How Spark optimizes large joins
- How to choose partition keys
Follow for more such content. Let me know if I missed anything in comments. Thank you!!
Top comments (0)