DEV Community

Manav Modi
Manav Modi

Posted on • Originally published at manavmodi.hashnode.dev

Standardizing Data

#### What is standardization?

It is a preprocessing method used to transform continuous data to make it look normally distributed.

Why?

Scikit-learn models assume normally distributed data. In the case of continuous data, we might risk biasing the models.

Two methods can be used for the standardization process:

  1. Log Normalization
  2. Feature Scaling

These methods are applied to continuous numerical data.

When?

  1. Models are present in linear space. (Ex. KNN, KMeans, etc.),data must also be in linear space.
  2. Dataset features that have high variance. This could bias a model that assumes it is normally distributed.
  3. Modeling dataset that has features that are continuous and on different scales. > For example, a dataset that has height and weight as its features needs to be standardized to make sure they are on the same linear scale.

What is Log Normalization?

  1. Log transformation is applied
  2. Used in datasets where the variance of a particular column is significantly high as compared to other columns
  3. Natural log is applied on values

image.png

  1. It is used to captured relative changes, and magnitude of change, and keeps everything in the positive space.

Let's see the implementation.

print(df) 
Enter fullscreen mode Exit fullscreen mode

image.png

print(df.var())
Enter fullscreen mode Exit fullscreen mode

image.png

We will use the log operator from the NumPy library to perform the normalization.

import numpy as np
df["log_2"] = np.log(df["col2"])
print(df)

Enter fullscreen mode Exit fullscreen mode

image.png
Let's see values.

print(np.var(df[["col1","log_2"]]))

Enter fullscreen mode Exit fullscreen mode

image.png

What is feature scaling?

This method is useful when

  1. continuous features are present on different scales.
  2. model is in linear scale.

The transformation on the dataset is done such that the resultant mean is 0 and the variance is 1.

Here across the features, you can see how the variation is.

print(df)
Enter fullscreen mode Exit fullscreen mode

image.png

print(df.var())
Enter fullscreen mode Exit fullscreen mode

image.png

Using the standardscaler method from sklearn, the process is done.

from sklearn.preprocessing  import StandardScaler
scaler = StandardScaler()
df_scaled= pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)
print(df.var())

Enter fullscreen mode Exit fullscreen mode

image.png

image.png

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter.

Discussion (0)