Tutorial1: Getting Started with Pyspark

bhadresh savani — Mon, 31 Oct 2022 17:49:33 +0000

As a Data Scientist one might have worked with large amount of data. I never got chance to work on large data earlier. Recently i came across a 1.3gb of sensor data, it was little hard to work on using pandas dataframe. I have to wait for couple of miniutes to read or write data or to perform data manipulation.

I also realize that, While working with big data we can't use pandas dataframe. It fails to give better performing in terms of reading and writing file(IO Operation), even data manipulation also takes time. Reading a 1gb csv file took around 44sec using pandas while Pyspark took just 6sec.(The time taken depends on hardware) It made me realize that i need to explore Pyspark.

In this tutorial, we will see pyspark installation step and doing some basic operation with dataframe object.

Step1. Pyspark Installation

You will require Java installed in the environment. It also ask for A proper Java Home Variable path defined in the environment. Make sure you install JDK or JRE.

To install pyspark we just need to do pip installation in conda or any python virtual environment

pip install pyspark

Step2. Session Initalization

Before doing any operation in pyspark we need to initialize spark session it can be done like this,

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Practice').getOrCreate()

Inside the appName, we can provide any name based on the objective. The session builder takes little time to setup but it is one time process.

Once its completed pyspark is ready to use

Step3. Reading a File

Pyspark syntax is very similar to pandas. In pandas library we read csv file like this,

import pandas as pd
df_pandas = pd.read_csv('sample.csv')
df_pandas

similarly, in the spark we have below syntax

df_pyspark = spark.read.csv("sample.csv")
df_pyspark.show()

Note: In pyspark dataframe will not be shown directly, we need to call show() on the dataframe object.

Step4. Some similar functions

There are some functions that are similar in pandas and pyspark dataframe like

# head
df_pandas.head()
df_pyspark.head()
# describe
df_pandas.describe()
df_pyspark.describe()

and many more that gives almost similar syntax and results.

Step5. Dissimilar functions

There are also few functions which works differently from pandas like column selection and slicing function
ex,

# column selection function
df_pandas['column1']
df_pyspark.select('column1').show()

With this note, the pyspark learning journey begins...