helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering and Machine Learning
Check out my blog - http://luminousmen.com
Thank you for your support, Maxime!
I say it's not trivial because you have to fulfill at least a few conditions.
You have to read the data in the same way as it was bucketed - while writing to the bucket Spark uses the hash function on the bucketed key to select which bucket to write the data into.
spark.sql.shuffle.partitions must be the same as the number of buckets, otherwise, we will get a standard shuffle
Choose the bucket columns wisely, everything depends on the workload. Sometimes it is better to handle the optimization process to the catalyst than to do it yourself.
Choose the number of buckets wisely, this is also a tradeoff. If you had as many performers as buckets, it would lead to a fast load. However, if the data volume is too small, it may not be very good in terms of performance.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Thank you for your support, Maxime!
I say it's not trivial because you have to fulfill at least a few conditions.