mary kariuki

Posted on Oct 18, 2022

Analyzing and cleaning data using pandas

Analyzing data using pandas

It is advisable to first make a quick overview of your dataset once you load your data into dataframes, a dataset can be in a csv format therefore to load it into dataframe we use the following,

import pandas as pd
dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dx)

we can view our data in various ways which includes,

head() function

it is a function that returns the headers and specified number of rows from top.

import pandas as pd
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\datas.csv")
print(dt.head(10))

tail() function

it returns the headers and specified number of row from bottom.

import pandas as pd
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.tail())

info () function

it is a function that gives more information about your dataset that is it show datatypes, non-null cells,memory among others.

dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.info())

describe() function

the function gives the description of your data that is the function shows the mean,median,standard deviation,maximum among others.

dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.describe())

Cleaning data using pandas

cleaning data is simply removing bad data in dataset,this may involve removing empty cells,removing duplicates,checking data with wrong format.

removing of duplicates

duplicates are rows that have been registered more than once.
to remove duplicates we use the duplicated() function which returns a boolean value that is True if duplicates exist else returns False.

dt.duplicated().sum()

removing empty cells

we remove the empty cells using dropna() function,this is a method that returns new dataset and will not change the original dataset.

dt = pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\datas.csv")
data1=dt.dropna()
print(data1)

cleaning wrong data

we fix wrong data in two ways that is by replacing the wrong values or removing those wrong values.

removing wrong data

dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
dx.dropna(subset=['fname'],inplace=True)
print(dx)

2.replacing wrong data

dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
for line in dx.index:
    if dx.loc[line,'ASSIGN 2']>90:
        dx.loc[line,'ASSIGN 2']=10
print(dx)

cleaning wrong format

it may difficult and impossible to analyze data with some columns or rows having wrong format.wrong format can be a row having multiple datatypes , to fix this one can convert the entire row into one datatype or remove the entire row from dataset.
removing entire row

dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
dx.dropna(subset=['row 3'])
print(dx)

Top comments (3)

Brenda Aluoch • Nov 16 '22

This is good

mary kariuki • Nov 16 '22

thanks brenda

HK MBURU • Oct 20 '22

nice

DEV Community

Analyzing and cleaning data using pandas

Analyzing data using pandas

Cleaning data using pandas

Top comments (3)

Read next

Synthetic Monitoring with Grafana Cloud

Troubleshooting Common Issues in Printed Circuit Board Fabrication

Hire Laravel Developers in 2024

Enhance Your Photos with ReminiLike: The Ultimate AI Photo Enhancer