As a Data Scientist one might have worked with large amount of data. I never got chance to work on large data earlier. Recently i came across a 1.3gb of sensor data, it was little hard to work on using pandas dataframe. I have to wait for couple of miniutes to read or write data or to perform data manipulation.
I also realize that, While working with big data we can't use pandas dataframe. It fails to give better performing in terms of reading and writing file(IO Operation), even data manipulation also takes time. Reading a 1gb csv file took around 44sec using pandas while Pyspark took just 6sec.(The time taken depends on hardware) It made me realize that i need to explore Pyspark.
In this tutorial, we will see pyspark installation step and doing some basic operation with dataframe object.
Step1. Pyspark Installation
You will require Java installed in the environment. It also ask for A proper Java Home Variable path defined in the environment. Make sure you install JDK or JRE.
To install pyspark we just need to do pip installation in conda
or any python virtual environment
pip install pyspark
Step2. Session Initalization
Before doing any operation in pyspark
we need to initialize spark session it can be done like this,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Practice').getOrCreate()
Inside the appName
, we can provide any name based on the objective. The session builder takes little time to setup but it is one time process.
Once its completed pyspark
is ready to use
Step3. Reading a File
Pyspark syntax is very similar to pandas. In pandas library we read csv file like this,
import pandas as pd
df_pandas = pd.read_csv('sample.csv')
df_pandas
similarly, in the spark we have below syntax
df_pyspark = spark.read.csv("sample.csv")
df_pyspark.show()
Note: In pyspark dataframe will not be shown directly, we need to call show()
on the dataframe object.
Step4. Some similar functions
There are some functions that are similar in pandas
and pyspark
dataframe like
# head
df_pandas.head()
df_pyspark.head()
# describe
df_pandas.describe()
df_pyspark.describe()
and many more that gives almost similar syntax and results.
Step5. Dissimilar functions
There are also few functions which works differently from pandas like column selection and slicing function
ex,
# column selection function
df_pandas['column1']
df_pyspark.select('column1').show()
With this note, the pyspark learning journey begins...
Top comments (0)