Machine Learning - Data Preprocessing- 1

Hi all, in this article we will discuss about the first step to building our model - Data Preprocessing.

Importing Libraries - In this we import most common used libraries like pandas, numpy . There might be others also but for this example lets keep it as it

import numpy as np import pandas as pd

Importing Datasets - for this example we have a dataset that is as below containing data if a purchase was made or not from a specific person having particular identity

In this data we can see first 3 columns are features and last is the dependent variable . We mostly divide our data using this only, with X as the inputs and y as the output and are loaded using pandas and then divided using iloc operation on dataset

dataset = pd.read_csv('Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values

So now we have our data separated and ready for further handling.

Handling missing data- So if you see the data we have 2 missing data (marked in yellow)

Most broadly used library for data science operations is sklearn. So in this case also we will use it only. Strategy for this will be replacing the empty data with average value from the column and to use it SimpleImputer from sklearn impute is used

from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values=np.nan, strategy='mean') imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3])

and it will give us

One more thing to take care here is first column that has text/string values, as models might not be able to interpret these correctly we also need to encode this to digits to be able to feed them to model.
For this ColumnTransformer is used from sklearn compose and OneHotEncoder from sklearn preprocessing

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') X = np.array(ct.fit_transform(X))

In this ColumnTransformer takes input as transformers that is tupple which has name of transformer, transformer itself and column to transform and remainder is to specify what to do for other columns, which here is passthrough which mean no change in other columns.
and our output will be

Hope it was helpful. In this next part of preprocessing we will see how to encode labels, Feature scaling and spliting data into training and test set.

DEV Community

Machine Learning - Data Preprocessing- 1

Top comments (0)

Okay