DEV Community

mary kariuki
mary kariuki

Posted on

Analyzing and cleaning data using pandas

Analyzing data using pandas

It is advisable to first make a quick overview of your dataset once you load your data into dataframes, a dataset can be in a csv format therefore to load it into dataframe we use the following,

import pandas as pd
dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dx)
Enter fullscreen mode Exit fullscreen mode

we can view our data in various ways which includes,

  • head() function

it is a function that returns the headers and specified number of rows from top.

import pandas as pd
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\datas.csv")
print(dt.head(10))
Enter fullscreen mode Exit fullscreen mode
  • tail() function

it returns the headers and specified number of row from bottom.

import pandas as pd
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.tail())
Enter fullscreen mode Exit fullscreen mode
  • info () function

it is a function that gives more information about your dataset that is it show datatypes, non-null cells,memory among others.

dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.info())
Enter fullscreen mode Exit fullscreen mode
  • describe() function

the function gives the description of your data that is the function shows the mean,median,standard deviation,maximum among others.

dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.describe())
Enter fullscreen mode Exit fullscreen mode

Cleaning data using pandas

cleaning data is simply removing bad data in dataset,this may involve removing empty cells,removing duplicates,checking data with wrong format.

  • removing of duplicates

duplicates are rows that have been registered more than once.
to remove duplicates we use the duplicated() function which returns a boolean value that is True if duplicates exist else returns False.

dt.duplicated().sum()
Enter fullscreen mode Exit fullscreen mode
  • removing empty cells

we remove the empty cells using dropna() function,this is a method that returns new dataset and will not change the original dataset.

dt = pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\datas.csv")
data1=dt.dropna()
print(data1)
Enter fullscreen mode Exit fullscreen mode
  • cleaning wrong data

we fix wrong data in two ways that is by replacing the wrong values or removing those wrong values.

  1. removing wrong data
dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
dx.dropna(subset=['fname'],inplace=True)
print(dx)
Enter fullscreen mode Exit fullscreen mode

2.replacing wrong data

dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
for line in dx.index:
    if dx.loc[line,'ASSIGN 2']>90:
        dx.loc[line,'ASSIGN 2']=10
print(dx)
Enter fullscreen mode Exit fullscreen mode
  • cleaning wrong format

it may difficult and impossible to analyze data with some columns or rows having wrong format.wrong format can be a row having multiple datatypes , to fix this one can convert the entire row into one datatype or remove the entire row from dataset.
removing entire row

dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
dx.dropna(subset=['row 3'])
print(dx)
Enter fullscreen mode Exit fullscreen mode

Top comments (3)

Collapse
 
brenda22 profile image
Brenda Aluoch

This is good

Collapse
 
marykariuki90 profile image
mary kariuki

thanks brenda

Collapse
 
hkmburu profile image
HK MBURU

nice