Analyzing data using pandas
It is advisable to first make a quick overview of your dataset once you load your data into dataframes, a dataset can be in a csv format therefore to load it into dataframe we use the following,
import pandas as pd
dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dx)
we can view our data in various ways which includes,
- head() function
it is a function that returns the headers and specified number of rows from top.
import pandas as pd
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\datas.csv")
print(dt.head(10))
- tail() function
it returns the headers and specified number of row from bottom.
import pandas as pd
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.tail())
- info () function
it is a function that gives more information about your dataset that is it show datatypes, non-null cells,memory among others.
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.info())
- describe() function
the function gives the description of your data that is the function shows the mean,median,standard deviation,maximum among others.
dt=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
print(dt.describe())
Cleaning data using pandas
cleaning data is simply removing bad data in dataset,this may involve removing empty cells,removing duplicates,checking data with wrong format.
- removing of duplicates
duplicates are rows that have been registered more than once.
to remove duplicates we use the duplicated() function which returns a boolean value that is True if duplicates exist else returns False.
dt.duplicated().sum()
- removing empty cells
we remove the empty cells using dropna() function,this is a method that returns new dataset and will not change the original dataset.
dt = pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\datas.csv")
data1=dt.dropna()
print(data1)
- cleaning wrong data
we fix wrong data in two ways that is by replacing the wrong values or removing those wrong values.
- removing wrong data
dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
dx.dropna(subset=['fname'],inplace=True)
print(dx)
2.replacing wrong data
dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
for line in dx.index:
if dx.loc[line,'ASSIGN 2']>90:
dx.loc[line,'ASSIGN 2']=10
print(dx)
- cleaning wrong format
it may difficult and impossible to analyze data with some columns or rows having wrong format.wrong format can be a row having multiple datatypes , to fix this one can convert the entire row into one datatype or remove the entire row from dataset.
removing entire row
dx=pd.read_csv(r"C:\Users\ADMIN\Desktop\EXCEL\RECORD2.csv")
dx.dropna(subset=['row 3'])
print(dx)
Top comments (3)
This is good
thanks brenda
nice