Standardizing Data

#beginners #python #machinelearning #datascience

#### What is standardization?

It is a preprocessing method used to transform continuous data to make it look normally distributed.

Why?

Scikit-learn models assume normally distributed data. In the case of continuous data, we might risk biasing the models.

Two methods can be used for the standardization process:

Log Normalization
Feature Scaling

These methods are applied to continuous numerical data.

When?

Models are present in linear space. (Ex. KNN, KMeans, etc.),data must also be in linear space.
Dataset features that have high variance. This could bias a model that assumes it is normally distributed.
Modeling dataset that has features that are continuous and on different scales. > For example, a dataset that has height and weight as its features needs to be standardized to make sure they are on the same linear scale.

What is Log Normalization?

Log transformation is applied
Used in datasets where the variance of a particular column is significantly high as compared to other columns
Natural log is applied on values

It is used to captured relative changes, and magnitude of change, and keeps everything in the positive space.

Let's see the implementation.

print(df)

print(df.var())

We will use the log operator from the NumPy library to perform the normalization.

import numpy as np
df["log_2"] = np.log(df["col2"])
print(df)

Let's see values.

print(np.var(df[["col1","log_2"]]))

What is feature scaling?

This method is useful when

continuous features are present on different scales.
model is in linear scale.

The transformation on the dataset is done such that the resultant mean is 0 and the variance is 1.

Here across the features, you can see how the variation is.

print(df)

print(df.var())

Using the standardscaler method from sklearn, the process is done.

from sklearn.preprocessing  import StandardScaler
scaler = StandardScaler()
df_scaled= pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)
print(df.var())