Ashutosh Sahu

Posted on Jun 11, 2021

#P4 - Data Preprocessing

#python #datascience #machinelearning

In the previous articles, we have seen a simple example of how machine learning helps us to predict a target. There are many algorithms and ways to train a model, but all they need is data. whenever we take data from the real world, it always contains some irregularities. Data preprocessing is the first and foremost step after acquiring data which is used to put data in the desired format so that the model can be trained on it.

The whole process is known as EDA (Exploratory Data Analysis). EDA also includes visualizing data using various types of visualization techniques. we will cover them in the next part.

There is a great article on data preprocessing see it here

Types of Data

When we talk about data in Machine learning, we always think of data in form of a table consisting of rows and columns. columns depict features and rows depict attributes of these features.
There are other types of data like JSON files, they contain information in the form of embedded documents and fields.
There is one more type of data that is completely unstructured, unlike the above two types. i.e. image, video, and other such files.

here we are only discussing the first type of data and its preprocessing.

Steps of Data preprocessing

1. Gathering Business Knowledge

Business knowledge is the most important thing while preprocessing data. It is often underestimated, but it matters most of the time.
Let's consider that a dataset contains a feature country. In that feature, Some entries are written as the UK while others are written as United Kingdom. Now both are the same, but if we don't have business knowledge then we can not figure this thing out, and this will eventually lead to wrong training.

2. Data Exploration

Data Exploration means exploring various aspects of data. following are certain ways to do so in python -

a. describe()

when we import a dataset in python using pandas, it is stored in an object called DataFrame. the DataFrame object has various inbuilt methods among which describe() function gives the detailed statistics of numerical features of the dataset.

import pandas
df = pandas.read_csv('path/to/csv')
df.describe()

The describe() function returns a dataframe object in following format.

As you can see, It tells about count, mean, standard deviation(std), mode,min, max and spread of values at 25%, 50%, 75%.

b. info()

info() is another function provided by the dataframe object, which shows all the features with their data types and no. of values in each feature.

c. Pandas Profiling

This is a much better way than the above two because it extracts almost all the information required to better explore the data. pandas profiling is a separate library from pandas.

To install or upgrade it, use

pip install --upgrade pandas-profiling

then use it as follows -

profile = df1.profile_report(title="<name>", explorative = True, minimal = False)
profile.to_file(output_file="<filename>.html")

the pandas profiling generates an HTML report format that gives you various insights for each feature.

3. Missing Values Imputation

It is very common to have missing values in the datasets. Mostly they are caused because peoples do not like to fill every detail of the form. Sometimes it can also be caused by machine errors such as irregular functioning of a sensor or a device collecting data.
Be aware, because many times, the missing values can appear in other forms than an empty cell in a table.
Mostly when some fields are marked necessary, and people don't want to fill them, they use characters like - or ? or NIL. In the case of data gathered by devices, they have some default values in place of null like -1 or 0 or -99.

Look for frequencies of such values. if they mostly occur in your dataset, you can consider them as NaN (Not a Number). You can replace these values with NaN or can directly interpret them as NaN while loading the dataset for the first time with pandas.

df = pandas.read_csv('data.csv', na_values = ['?','-1'])

The na_values replaces all values that match in given list and change them to NaN.

Getting count of all missing values.

df.isna().sum()

Handling missing values

There are mainly two ways to handle such values.

remove the observations(rows) containing them.

# select only those observation for a given column that 
# don't have a Null or Nan value.
df = df[df['column'].notna()]

this option should be used when we have a very large dataset, and no. of rows removed after this operation will not affect much.

remove the whole column.

df.drop(columns = ['col1','col2'], inplace = True)

This option should be used only when the no. of missing values are too much ( > 50%) in a column.

Impute missing values with the central tendencies of data.

central tendency is a central or typical value for a probability distribution. It may also be assumed that the whole data lies at the center or location of the distribution.

When data is numerical

there are two ways to find the central tendency. Mean and Median.
Mean should be chosen only when the data is distributed uniformly. Suppose if you have most of the values like 100,150,200,250 and only few like 800 then mean of these values will be 300, but median will be 200. Now, if we see, the metric which can approximately give us the central tendency is median in this case.
so whenever, there is a large difference in mean and median, we should choose median.

df['col1'].fillna(value = df['col1'].mean(), inplace = True)
df['col2'].fillna(value = df['col2'].median(), inplace = True)

When data is categorical

Central tendency lies with most frequent category in such cases.

df['col3'].fillna(value = df['col1'].mode(), inplace = True)

4. Outlier Treatment

an outlier is a data point that differs significantly from other observations.

Outlier in scatter plot -

Outlier in Box plot -

image source: google

we will cover the plotting of these graphs in next part.
Outliers can decrease performance of model. They should be removed or treated.
IQR (Inter-Quartile Range): The Inter-Quartile Range is just the range of the quartiles: the distance from the largest quartile to the smallest quartile,
largest quartile is median of upper half of data and lowest quartile is median of lower half of data. We remove the values which exceeds this range.

image source: google

# lowest quartile 
q1 = df['col'].quantile(0.25) 
q2 = df['col'].quantile(0.50) 
q3 = df['col'].quantile(0.75) 
iqr = q3 - q1 

low = q1 - 1.5 * iqr
high = q3 + 1.5 * 1qr

# remove
df[df['col'] > high | df['col'] < low].drop()
or 
# replace with median
df.loc[df['col'] > high | df['col'] < low, 'col'] = q2

5. Variable Transformation

Sometimes, a single feature or attribute contains multiple values.
example -

Normalization is a concept in DBMS in which a database is said to be in first normal form, if it contains atomic values (value that cannot be divided). To solve such kind of problems, you can refer to stackoverflow.

Another problem that is observed is dealing with dates. A date object is always represented in form of strings. we can use pandas to extract year, month, day, week etc from the date.

df['date'] = pandas.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month']= df['date'].dt.month
df['day']  = df['date'].dt.day
df['week'] = df['date'].dt.week

6. Seasonality

Seasonality is a characteristic of a time series in which the data experiences regular and predictable changes that recur every calendar year. Any predictable fluctuation or pattern that recurs or repeats over a one-year period is said to be seasonal. The most easy example is Rainfall.

Any feature of dateset that shows seasonality with time, can cause instability to the model because it never shows a clear relationship with output. So if the target is itself not a seasonal value (models exist for such predictions) then we should remove seasonality from such data.

A common way to deal with seasonal data is to use differencing. If the season lasts for a week, then we can remove it on an observation today by subtracting the value from last week.
Similarly, if a season cycles in a year, like rainfall then we can substract the daily maximum rainfall from same day last year to correct seasonality.

seasonality

after removing seasonality

image source : machinelearningmastery.com

7. Bivariate / Correlation Analysis

Bivariate analysis is a quantitative analysis, which involves two variables for determining the empirical relationship between them.
This can be done drawing plots between two variables and check their distributions.

correlation or dependence is a statistical relationship between two variables or bivariate data. Correlation referes to the degree to which a pair of variables are linearly related. a positive correlation shows that the variables are directly proportional while negetive shows that they are inversely proportional. Value of correlation coefficient lies between -1 to 1 only.
correlation formula

why should we remove higher correlation features?

correlation measures only the association between the two variables. It doesn't tells causation. i.e., large values of y is not caused by large values of x. When we have highly correlated features in the dataset, the variance will also become high. this causes instability in model. The model becomes sensitive towards these values and slight changes in them affects the whole model.
So it is better to drop any one feature if two features are showing a higher correlation.

pyplot.figure(figsize=(10,10))
seaborn.heatmap(df.corr(), annot = True, fmt = '.2f', square = True, vmax=1, vmin = -1,linewidths=0.5, cmap='Dark2')

output

image source : google
The heatmap shows the correlation values of all the features with others. We can analyse which two feature are highly correlated and remove any one of them.

8. Label Encoding

Before moving further we have to recognise what categorical data is. Many times we think that if a variable is in the form of objects(strings) then it is a categorical variable.
suppose a dataset contains a feature year, which depicts only 3 distinct years - 2008, 2009, 2010. Are these values categorical or numerical?
They are categorical values, but their numerical interpretation is a problem for model. Also the model can only work over numerical data. That means we cannot use categorical data containing strings. To use such features, we have to transforming them into numerical form.
For example, we map our categorical data to simple counting numbers like 0,1,2,3...
This is called label encoding.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['year'] = le.fit_transform(df['year'])

What if the categories are ordinal? we have to encode them in an order. example, if a dataset contains some grades from Ato D, which are correlated to marks. A means good marks, D means bad marks. then this is a ordinal variable. we have to encode it in same order that it follows. means A = 0, B = 1, C = 2, D = 3

le.fit(["grade_A", "grade_B", "grade_C", "grade_D"])
print(list(le.classes_))
df['grades'] = le.transform(df['grades'])

9. One Hot Encoding

A one hot encoding allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. When a one hot encoding is used it may offer a more nuanced set of predictions than a single label
In one hot encoding, the label encoded variable is removed and a new binary variable is added for each unique label value.
Visuals are better than explaining.

Again, one hot encoding causes higher correlation between its values. so we should drop a feature after one hot encoding.
Scikit learn provides this inbuilt functionality.

from sklearn.preprocessing import OneHotEncoder
df_enc = OneHotEncoder(drop='first').fit(df['grades'])

Pandas also provide OHE by its get_dummies() function.

df_enc = pandas.get_dummies(data = df, columns = ['year', 'grades'], drop_first = True)

10. Scaling

Scaling means fitting the data values under a common scale (range).
suppose we have a dataset having two features on completely different scale. one feature lies in range 1 to 30, while other lies in range 4000 to 100000. Now there are some algorithms like K nearest neighbor, which classify data points based on distances between their features. In such algorithm, values of feature having small range will not be going to affect the distance majorly. So, having such a feature is meaningless. Almost every algorithm in ML deals with such kind of geometrical distances, except decision tree category algorithms.
You can refer to this article for more info about algorithms.
So there is a need to bring them on a common scale.

There are mainly two types of scaling techniques.

Standardization

Standardization is used to center the values around the mean with a unit standard deviation. this makes the mean to becomes zero with a unit standard deviation.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['cost','expenses']] = scaler.fit_transform(df[['cost','expenses']])

before and after standardization

image source : google
Standardization should be done when the data shows a bell curve a.k.a Normal Distribution or Gaussian Distribution.

Normalisation

Normalization is used to scale the data of an attribute so that it falls in a smaller range, most commonly a range between 0 to 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['cost','expenses']] = scaler.fit_transform(df[['cost','expenses']])

Normalization is well suited when data values appear to be continuously increasing or decreasing.

11. Oversampling

In around 2018, Amazon crafted an AI recruiting tool that review the job applicant's resume and give them ratings, similar to the amazon shopping ratings. The AI was trained on the basis of previous applicant's data, most of which are men. The AI became bias towards men and taught itself that men candidates are mostly preffered. The gender of candidate was not explicitely given to him, but he found that by their resume finding words like women, female chess champion, etc and rated them low for job.
This all happened because the training data was imbalanced. There are two ways to handle imbalanced data.

Undersampling

Reduce number of samples of a class having higher number of samples. This method is used when we have a very large dataset and removing instances doesn't cause much loss of data.

Oversampling

Increase the number of samples of class having lower number of samples. It is also known as data augmentation.
There are again two major algorithms for oversampling.

SMOTE (Synthetic Minority Oversampling Technique)

SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

# x = all features except y, y = imbalanced feature
from imblearn.over_sampling import SMOTE
smote = SMOTE()
os_x,os_y = smote.fit_resample(x,y)

ADASYN (Adaptive Synthetic)

ADASYN is a generalized form of the SMOTE algorithm. Only difference in SMOTE and ADASYN is that it considers the density distribution, which decides the no. of synthetic instances generated for samples. It helps to find those samples which are difficult to learn. Due to this, it helps in adaptively changing the decision boundaries based on the samples difficult to learn.

from imblearn.over_sampling import ADASYN
adasyn = ADASYN(random_state = 99)
os_x, os_y = adasyn.fit_resample(x,y)

What's next

We will see different visualization techniques and plots using matplotlib and seaborn in next article.

DEV Community