DEV Community

Cover image for Azure Blob Storage with Pyspark
luminousmen
luminousmen

Posted on • Edited on • Originally published at luminousmen.com

3 3

Azure Blob Storage with Pyspark

Azure Blob Storage is a service for storing large amounts of data stored in any format or binary data. This is a good service for creating data warehouses or data lakes around it to store preprocessed or raw data for future analytics.

In this post, I'll explain how to access Azure Blob Storage using Spark framework on Python.


Azure blob requires to install additional libraries for accessing data from it, because it uses wasb/wasbs protocol, not de facto standard hdfs protocol. Wasbs protocol is just an extension built on top of the HDFS APIs. In order to access resources from Azure blob you need to add built jar files, named hadoop-azure.jar and azure-storage.jar to spark-submit when you submitting a job.

$ spark-submit --py-files src.zip \
            --master yarn \
            --deploy-mode=cluster \
            --jars hadoop-azure.jar,azure-storage.jar
            src/app.py
Enter fullscreen mode Exit fullscreen mode

Also if you're using Docker or deploying the application to a cluster, there is a tip for you as well. You don't need to pass additional jars if they are already in the right place — where pyspark can find it(it need to be done on every cluster node of course). Use the commands below with caution — the versions/links may be wrong!

$ wget -nc -nv -O /usr/local/lib/python3.5/site-packages/pyspark/jars/azure-storage-2.2.0.jar http://central.maven.org/maven2/com/microsoft/azure/azure-storage/2.2.0/azure-storage-2.2.0.jar
$ wget -nc -nv -O /usr/local/lib/python3.5/site-packages/pyspark/jars/hadoop-azure-2.7.3.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.3/hadoop-azure-2.7.3.jar
Enter fullscreen mode Exit fullscreen mode

On the application level, first of all as always in spark applications, you need to grab a Spark Session. Spark Session is the entry point for the cluster resources — for reading data and execute SQL queries over data and getting the results.

session = SparkSession.builder.getOrCreate()
Enter fullscreen mode Exit fullscreen mode

Then set up an account key to your blob container:

session.conf.set(
    "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
    "<your-storage-account-access-key>"
)
Enter fullscreen mode Exit fullscreen mode

or SAS token:

session.conf.set(
    "fs.azure.sas.<container-name>.blob.core.windows.net",
    "<sas-token>"
)
Enter fullscreen mode Exit fullscreen mode

Once an account access key or SAS token is set up you're ready to read/write to Azure blob:

sdf = session.read.parquet(
    "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<prefix>"
)
sdf.show()
Enter fullscreen mode Exit fullscreen mode

Thank you for reading!

Any questions? Leave your comment below to start fantastic discussions!

Check out my blog or come to say hi 👋 on Twitter or subscribe to my telegram channel.
Plan your best!

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (1)

Collapse
 
palanithangaraj profile image
palani-thangaraj

session.write.parquet does not work

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more