Hi all, in this article we will discuss about the first step to building our model - Data Preprocessing.
Importing Libraries - In this we import most common used libraries like pandas, numpy . There might be others also but for this example lets keep it as it
import numpy as np
import pandas as pd
Importing Datasets - for this example we have a dataset that is as below containing data if a purchase was made or not from a specific person having particular identity
In this data we can see first 3 columns are features and last is the dependent variable . We mostly divide our data using this only, with X as the inputs and y as the output and are loaded using pandas and then divided using iloc operation on dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
So now we have our data separated and ready for further handling.
Handling missing data- So if you see the data we have 2 missing data (marked in yellow)
Most broadly used library for data science operations is sklearn. So in this case also we will use it only. Strategy for this will be replacing the empty data with average value from the column and to use it SimpleImputer from sklearn impute is used
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
and it will give us
One more thing to take care here is first column that has text/string values, as models might not be able to interpret these correctly we also need to encode this to digits to be able to feed them to model.
For this ColumnTransformer is used from sklearn compose and OneHotEncoder from sklearn preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
In this ColumnTransformer takes input as transformers that is tupple which has name of transformer, transformer itself and column to transform and remainder is to specify what to do for other columns, which here is passthrough which mean no change in other columns.
and our output will be
Hope it was helpful. In this next part of preprocessing we will see how to encode labels, Feature scaling and spliting data into training and test set.
Top comments (0)