DEV Community

Pushpa Sree Potluri
Pushpa Sree Potluri

Posted on

1 1

Environment setup for Data Analysis with PySpark and Spark SQL

Data Analysis is all about extracting all possible insights from your dataset. A very important step in building a machine learning model is to get to know the data. Spark is widely used for its parallel data processing on computer clusters. Spark supports multiple programming languages (Python, Scala, R, and Java) and includes libraries for SQL(Spark SQL), machine learning(MLlib), stream processing (spark streaming), and graph analytics (GraphX). In this post, I am going to use PySpark and Spark SQL for my data analysis.

If you want to run Spark locally, you should have Java, as well as Python (Python 3), installed on your machine.

Install Spark
i. Go to https://spark.apache.org/downloads.html
ii. Select version and package type
Alt Text
iii. Click on the download link, it will bring you to Apache Software Foundation site. From this site, you can start downloading
Alt Text
iv. Set up some environment variables for Spark home and PySpark in a file called .bash_profile
Alt Text
v. Install PySpark - I am using Python installer program (pip) to install PySpark
Alt Text
Launching Jupyter Notebook
i. Install jupyter notebook with python installer

Alt Text

ii. Open terminal window, navigate to your working directory and type jupyter notebook. This will launch jupyter notebook

Alt Text

Alt Text

iii. Create new jupyter notebook by clicking on the "New" button on the upper right side and selecting Python 3
Alt Text

AWS Q Developer image

Your AI Code Assistant

Ask anything about your entire project, code and get answers and even architecture diagrams. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Start free in your IDE

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay