DEV Community

muradalhasan
muradalhasan

Posted on • Edited on

Pandas-Data Manipulation and Analysis

** Introduction to Pandas**

Pandas is basically a Python library that is used for analyzing, cleaning, exploring, and manipulating data. Pandas allow us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant. Relevant data is very important in data science.

Installation and importing Pandas.

First of all, open your jupyter notebook, then go to your command panel and write “pip install Pandas”. Here we go, you are done with the installation of the Pandas library. Now, to use the Pandas library you need to import Pandas in your notebook, to do so write “import Pandas as PD”. You are ready to use Pandas Library.
Finally, check if the installation was successful.

import pandas as pd
print(pd.version)

Getting Started with Pandas

Series

A series can actually hold pretty much almost all types of data
objects with python as its data point. Series also can take lots of parameters (data, index, dtype, name, copy, fastpath). To create a pandas series we just need a list initially it will generate random index we can also manipulate index by sending a list of index with the index parameter.

Image description
Just look at the picture above. we have used an integer list and series automatically generated index but in the 2nd and 3rd series we have customized the index as we wanted that's the beauty of Pandas.

Image description
We can also use a dictionary and Pandas series will use its key as index .

Image description
Use the index option to specify only the words you wish to be included in the Series, leaving out the rest of the words in the dictionary. Two series can added.

Data Frames
Data Frame is known as the main tool of Pandas. It takes multiple parameters like data, column, rows, dtype, copy. To start with Data Frames we have some necessary setting in our Jypyter. As we don't have any specific data so, we will use random data to learn Data Frames.

import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
df=pd.DataFrame(randn(5,4),["A","B","C","D","E"],["X","Y","Z","P"])
df
Enter fullscreen mode Exit fullscreen mode

output


       X           Y           Z          P
A   2.706850    0.628133    0.907969    0.503826
B   0.651118    -0.319318   -0.848077   0.605965
C   -2.018168   0.740122    0.528813    -0.589001
D   0.188695    -0.758872   -0.933237   0.955057
E   0.190794    1.978757    2.605967    0.683509
Enter fullscreen mode Exit fullscreen mode

Data Frames is just bunches of series. we can access those series by following command.

 **df["X"]
output
    X          Y
A   2.706850    0.628133
B   0.651118    -0.319318
C   -2.018168   0.740122
D   0.188695    -0.758872
E   0.190794    1.978757
Enter fullscreen mode Exit fullscreen mode

more than one series is a data frames

**df[["X","Y"]]
        X         Y
A   2.706850    0.628133
B   0.651118    -0.319318
C   -2.018168   0.740122
D   0.188695    -0.758872
E   0.190794    1.978757
Enter fullscreen mode Exit fullscreen mode

we can also add or drop any column.
To add any column.

df["updated"]=df["X"]+df["Y"]
df["updated"]

A    3.334983
B    0.331800
C   -1.278046
D   -0.570177
E    2.169552
Enter fullscreen mode Exit fullscreen mode

it will add the number column x and y and create a column which will be added as updated.
To drop any column.

df.drop("updated",axis=1,inplace=True)
df
         X          Y          Z        p
A   2.706850    0.628133    0.907969    0.503826    
B   0.651118    -0.319318   -0.848077   0.605965    
C   -2.018168   0.740122    0.528813        -1.278046
D   0.188695    -0.758872   -0.933237   0.955057    
E   0.190794    1.978757    2.605967    0.683509    
Enter fullscreen mode Exit fullscreen mode

as we have set inplace as true so the updated column will drop permanently.
we can also add or drop any rows or columns by using this command.
df.drop("D")
we just have to put the row or column name and it will be droped automatically.
we can check dimension of a data frames by
df.shape

Conditional selection in Data Frames

df>0
output
      X     Y       Z        P      updated
A   True    True    True    True    True
B   True    False   False   True    True
C   False   True    True    False   False
D   True    False   False   True    False
E   True    True    True    True    True
Enter fullscreen mode Exit fullscreen mode

we can also specify only false value

booldf=df>0
df[booldf]
output
       X            Y           Z            P  
A   2.706850    0.628133    0.907969    0.503826    
B   0.651118    NaN          NaN            0.605965
C   NaN         0.740122    0.528813    NaN
D   0.188695    NaN          NaN           0.955057 
E   0.190794    1.978757    2.60596     2.169552
Enter fullscreen mode Exit fullscreen mode

we can also use specific rows or column or both

df["X"]>0
Enter fullscreen mode Exit fullscreen mode

starting with multiple conditions

df[(df["X"]>0) | (df["Y"]>1)]
Enter fullscreen mode Exit fullscreen mode

** skipping index reset it's bit harder to explain index reset in that blog.

Pipe Index Oparetor

index level

a="G1 G1 G1 G2 G2 G2".split()
outside=a
inside=[1,2,3,1,2,3]
higher_index=list(zip(outside,inside))
higher_index=pd.MultiIndex.from_tuples(higher_index)
Enter fullscreen mode Exit fullscreen mode

now we can use that pipe index to built a multi level index

df=pd.DataFrame(randn(6,2),higher_index,["A","B"])

Enter fullscreen mode Exit fullscreen mode

to get sub data frames from a multilevel index

df.loc["G1"]
df.loc["G1"].loc[1]
Enter fullscreen mode Exit fullscreen mode

Missing Data

d={"A":[1,2,np.nan],"B":[1,np.nan,np.nan],"C":[1,2,3]}
df=pd.DataFrame(d)
output
       A    B   C
0   1.0 1.0 1
1   2.0 NaN 2
2   NaN NaN 3
Enter fullscreen mode Exit fullscreen mode

to drop any row that hold missing data

df.dropna()
it will drop that rows that has one or more NaN value 
df.dropna(axis=1)
it will drop that coloum that has one or more NaN value  
Enter fullscreen mode Exit fullscreen mode

to fill any missing data

df.fillna(value="FIll  VALUE")
Enter fullscreen mode Exit fullscreen mode

to chnage specific value

df["A"].fillna(value=3)
Enter fullscreen mode Exit fullscreen mode

mean function

df["A"].fillna(value=df["A"].mean())
Enter fullscreen mode Exit fullscreen mode

GROUP BY

all functions of group by

data={"Company":["GOOG","GOOG","ghgh","fdff"],"Persoon":["Mridul","Mohosin","Sabid","Raihan"],"Sales":[10,12,34,45]}
df=pd.DataFrame(data)
df
bycomp=df.groupby("Company")
to find minimun valuee
bycomp.mean()
Enter fullscreen mode Exit fullscreen mode

total sell of a company

bycomp.sum()
Enter fullscreen mode Exit fullscreen mode

standered division

bycomp.std()
Enter fullscreen mode Exit fullscreen mode

to see sells of a specific company

bycomp.sum().loc["GOOG"]
Enter fullscreen mode Exit fullscreen mode
df.groupby("Company").sum().loc["GOOG"]
df.groupby("Company").count()
df.groupby("Company").max()
df.groupby("Company").min()
df.groupby("Company").describe()
df.groupby("Company").describe().transpose()
df.groupby("Company").describe().transpose()["GOOG"]
Enter fullscreen mode Exit fullscreen mode

merging,joining,concatenating

dic={
    'col1':[1,2,3,4,5],
    "col2":[45,5,77,88,5],
    "col3":["abc","def","ghi","jkl","mno"],
}
df=pd.DataFrame(dic)
df.head()
Enter fullscreen mode Exit fullscreen mode

to find unique values in a specific coloum

df["col2"].unique()
Enter fullscreen mode Exit fullscreen mode

it will return a arrays with unique values

we can aslo check the lenght of that array by len function

len(df["col2"].unique())

Enter fullscreen mode Exit fullscreen mode

nunique is an alternatives of len function

df["col2"].nunique()

Enter fullscreen mode Exit fullscreen mode

unique value count method

df["col2"].value_counts()
Enter fullscreen mode Exit fullscreen mode

filtering data using conditional statement

df[df["col1"]>2]
Enter fullscreen mode Exit fullscreen mode

pandas can use customizer function

Thank you

Top comments (0)