Hi, thank you for sharing.
Could you elaborate on the gotchas? And why do you find the use not trivial? Is it because we have to save the df using write method ect? So sometimes we will save the df "normally" and using buckets?
helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering and Machine Learning
Check out my blog - http://luminousmen.com
Thank you for your support, Maxime!
I say it's not trivial because you have to fulfill at least a few conditions.
You have to read the data in the same way as it was bucketed - while writing to the bucket Spark uses the hash function on the bucketed key to select which bucket to write the data into.
spark.sql.shuffle.partitions must be the same as the number of buckets, otherwise, we will get a standard shuffle
Choose the bucket columns wisely, everything depends on the workload. Sometimes it is better to handle the optimization process to the catalyst than to do it yourself.
Choose the number of buckets wisely, this is also a tradeoff. If you had as many performers as buckets, it would lead to a fast load. However, if the data volume is too small, it may not be very good in terms of performance.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Hi, thank you for sharing.
Could you elaborate on the gotchas? And why do you find the use not trivial? Is it because we have to save the df using write method ect? So sometimes we will save the df "normally" and using buckets?
Thank you ;)
Thank you for your support, Maxime!
I say it's not trivial because you have to fulfill at least a few conditions.