This is my second post about Spark on Kubernetes. I wanted to share my experience with reducing the costs of Spark computation in clouds, which can be expensive, but can be decreased by 60-70%. I am using Spark version 3.3.1.
'1. If you are running your research in client mode from iPython notebook, it is recommended to use dynamic allocation. This configuration allows you to create an executor pod only during compute time, after which the executor stops.
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.shuffleTracking.enabled true
spark.dynamicAllocation.shuffleTracking.timeout 120
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.maxExecutors 10
'2. Using spot nodes for executors significantly reduce costs (60-90% cheaper than on-demand nodes). To create a spot node group, you need to label it, for example, spark: spot. However, for driver still on-demand nodes should be used.
If you are running in client mode, set the following configuration
spark.kubernetes.executor.node.selector.spark spot # here you label k,v in my case k=spark, v=node
If you are using Spark Operator, use the following configuration settings:
spec:
driver:
nodeSelector:
- key1: value1
- key2: value2
executor:
nodeSelector:
- key1: value1
- key2: value2
P.S use volume mount from next point to keep executors temp results is case of spot node interruption
'3. Use SSD volume mount to executors. As mentioned above to keep executor temp results in case of spot node interruption. For this purpose, it is best to use an SSD volume mount, which accelerates the write and read of temp files that Spark saves on disk. You can use the following configuration settings:
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass gp # your cloud ssd storage class
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit 100Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path /data
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly false
'4. These are the recommended default values from "Learning Spark":
spark.shuffle.file.buffer 1m
spark.file.transferTo false
spark.shuffle.unsafe.file.output.buffer 1m
spark.io.compression.lz4.blockSize 512k
In conclusion, by following the above steps, you can significantly reduce the cost of running Spark computations in the cloud. Dynamic allocation, using spot nodes for executors, and SSD volume mounts can reduce costs by up to 60-90%. Additionally, using default values as recommended in "Learning Spark" can help optimize performance. Remember to always prioritize the needs and satisfaction of the user when making any changes and to thoroughly test any configurations before implementing them. By doing so, you can provide a useful and enjoyable experience for your users while also being cost-effective.
Recources:
https://spot.io/blog/how-to-run-spark-on-kubernetes-reliably-on-spot-instances/
https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/
P.S. My first post about spark on k8s
How to run Spark on kubernetes in jupyterhub
https://dev.to/akoshel/spark-on-k8s-in-jupyterhub-1da2
Top comments (0)