DEV Community

Michael Staszel
Michael Staszel

Posted on • Originally published at mikestaszel.com on

Spark to Google Cloud Storage

This is the last post in my series of how to connect Spark to various data sources.

Here is how to connect to Google Cloud Storage using Spark 3.x.

First, create a service account in GCP, then download the JSON key file. Save it somewhere secure (e.g. import it into Vault or Secrets Manager).

Grant that service account permission to read and write to the bucket.

Grab the Hadoop 3.x JAR (gcs-connector-hadoop3-latest.jar) from Google.

Place that JAR into Spark’s jars directory.

Finally, fire up a Spark shell (or PySpark as in this example), and run:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("google.cloud.auth.service.account.json.keyfile", "PATH-TO-JSON-KEYFILE.json")

df = spark.createDataFrame([{"num": x} for x in range (10000)])
df.write.mode("overwrite").parquet("gs://BUCKET-NAME/hello/")

Enter fullscreen mode Exit fullscreen mode

That’s all there is to it.

This was a quick post mainly for myself if I ever need to come back to it.

Top comments (0)