DEV Community

Cover image for Powerful Pandas! Part-1
Aziza Afrin
Aziza Afrin

Posted on • Updated on

Powerful Pandas! Part-1

Today we gonna cover Pandas library.
Pandas is a python library which usually use for data manipulation and data analysis. Mostly used in Data Science and Machine Learning. In this notebook we gonna show how powerful pandas library is!

Let's get started!

Let's call the numpy and pandas library into our workspace. Here, we are using kaggle notebook where these libraries are already installed.

import numpy as np
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

If these libraries aren't installed in you IDE, you have to install them before calling them.

1. Pandas Series

Let's create a series with pandas.

a1=['a','b','c']
my_data=[50,70,30]
ar=np.array(my_data)
d={'a':50,'b':70,'c':30}
Enter fullscreen mode Exit fullscreen mode
pd.Series(data=my_data, index=a1)
Enter fullscreen mode Exit fullscreen mode

Same thing could be done with:

pd.Series(my_data,a1)
Enter fullscreen mode Exit fullscreen mode

and also with:

pd.Series(d)
Enter fullscreen mode Exit fullscreen mode

Indexing in series

series1=pd.Series([1,2,3,4],['A','B','C','D'])
series1
Enter fullscreen mode Exit fullscreen mode
series1['C']
Enter fullscreen mode Exit fullscreen mode

2. Pandas DataFrames

Call the required library for creating data frame in python with pandas.

import numpy as np
import pandas as pd
Enter fullscreen mode Exit fullscreen mode
from numpy.random import randn
Enter fullscreen mode Exit fullscreen mode

Setting a fixed seed point as we want to draw the same set of random numbers each time we run the code. Otherwise our result would be vary every time we run the code.

np.random.seed(1011)
Enter fullscreen mode Exit fullscreen mode
df=pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
df
Enter fullscreen mode Exit fullscreen mode

Here is our data frame.

Image description

If we want to grab the column 'W', output gives a series

df['W']
Enter fullscreen mode Exit fullscreen mode

another way to grab a column like sql

df.W
Enter fullscreen mode Exit fullscreen mode

If we want to grab multiple column, output gives a dataframe

df[['W','Z']]
Enter fullscreen mode Exit fullscreen mode

Add a column
Let's add a column to the data frame

df['H']=df['W']+df['Z']
Enter fullscreen mode Exit fullscreen mode

Delete a column
To delete a column we will use drop function

df.drop('H',axis=1)
Enter fullscreen mode Exit fullscreen mode

But if you run again the dataframe new column is still there, so we have to add another argument.

df.drop('H',axis=1,inplace=True)
Enter fullscreen mode Exit fullscreen mode

this permanently deletes the column.

Selecting rows, labelbased index:

df.loc[['A','B'],['W','Y']]
Enter fullscreen mode Exit fullscreen mode

Conditional selection
Select rows where W column value is greater than zero along with Y and X column.

df[df['W']>0][['Y','X']]
Enter fullscreen mode Exit fullscreen mode

Multiple selection: Can you explain what result will give the following code?

df[(df['W']>0) & (df['Y']>1)]
Enter fullscreen mode Exit fullscreen mode
df[(df['W']>0) | (df['Y']>1)]
Enter fullscreen mode Exit fullscreen mode

Multi-level index or index higher key
Now we will create a data frame with index more than one level.

outside=['G1','G1','G1','G2','G2','G2']
inside=[1,2,3,1,2,3]
hi_index=list(zip(outside,inside))
hi_index=pd.MultiIndex.from_tuples(hi_index)
Enter fullscreen mode Exit fullscreen mode
df=pd.DataFrame(randn(6,2),hi_index,['A','B'])
Enter fullscreen mode Exit fullscreen mode
df
Enter fullscreen mode Exit fullscreen mode

Image description

To grab everything under G1

df.loc['G1']
Enter fullscreen mode Exit fullscreen mode

Try to explain which value we want to grab with following code:

df.loc['G2'].loc[2]['B']
Enter fullscreen mode Exit fullscreen mode

3. Read CSV file

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

df = pd.read_csv('/kaggle/input/pandas/data_set.csv')

print(df.to_string()) 
Enter fullscreen mode Exit fullscreen mode

4. Correlations

The relationship between each column in your data set can be calculated by cor() method. The relationship between the columns of our data

df.corr()
Enter fullscreen mode Exit fullscreen mode

Correlation value varies from -1 to 1. Negative value indicate negative relationship that is if values of variable increases, other will decreases. Positive value mean a positive relationship, values of variable increases, other will increase too. 1 indicates perfect relationship.

You can practice more example at your own. The notebook link is given below. Go to the link and practice.
Notebook Link: [https://www.kaggle.com/code/azizaafrin/powerful-pandas-part-1]

Happy Learning!❤️

Aziza Afrin

Top comments (0)