Cloud Dataproc Best Practices

#database #datascience #gcp #googlecloud

Cloud Dataproc is a managed Spark and Hadoop service. Cloud Dataproc is used for batch processing, querying, streaming, and machine learning and takes advantage of open-source data tools. Cloud Dataproc automation helps to create clusters quickly and manage them easily, and when it is not required save money by turning off the clusters. Since this service takes less time and less money is spent on administration, a developer can focus on jobs and data.

Here are some of the best practices for Cloud Dataproc:

Specifying versions of cluster image: Cloud Dataproc image versions are important and used to package operating systems and massive data elements (including core and elective components) and GCP connectors into one package that’s deployed on a cluster. Cloud Dataproc defaults to the latest stable image version if the image version is not specified during its creation. For production environments, it is suggested that just continuously associate cluster creation step with a specific minor Cloud Dataproc version. This will ensure you know the precise OSS software versions that your production jobs require while Cloud Dataproc allows you to define a subminor version (that is, 1.4.xx instead of 1.4). In most environments, it’s preferred to reference Cloud Dataproc minor versions solely (as shown within the Google cloud command). Subminor versions are updated periodically for patches or fixes, hence new clusters automatically get security updates without breaking compatibility.

New minor versions of Cloud Dataproc are made available during a preview, non-default mode before they become the default. With this approach, production jobs get tested and validated against new versions of Cloud Dataproc before creating the version substitution.

Understand when to use custom images: Custom images from the most recent images in your target minor track need to be created if there are some dependencies that have to be shipped with the cluster like native Python libraries that have to be put on all nodes, or specific security hardening software or virus protection software needs for the image. This makes sure to enable those dependencies to be met every time. Users must update the subminor within their track every time they reconstruct the image.

Use the jobs API for submissions: With the Cloud Dataproc Jobs API, it makes it easy and doable to submit a job to an already present Cloud Dataproc cluster with a job. Using the Google cloud command-line tool or the GCP Console itself, the HTTPS call is submitted. It additionally makes it simple to divide and segregate the permissions of who has access to submit jobs on a cluster and permissions to achieve the cluster itself, without putting in place entry nodes or having to use something like Apache Livy.

Hope this was helpful.

DEV Community

Cloud Dataproc Best Practices

Top comments (0)