Importance of Feature Scaling
Machine learning algorithms, such as linear regressions and neural networks, work better or converge faster when the features are on a similar scale, and standardization makes the scale of the features similar.
For example, when considering features like age and income, your model may prioritize income over age due to the significant difference in the scale of values.
Standardization (Z-score normalization)
Standardization rescales the feature of a dataset so that they have a mean of 0 and a standard deviation (SD) of 1. This feature scaling technique is achieved by subtracting the average value of the feature from respective feature and then dividing by the standard deviation.
The formula for standardization is:
It is less affected by outliers than normalization. Therefore, this method often used when the maximum and minimum values are not fixed or when outliers exist.
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
print(X_scaled)
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
Normalization (Min-Max scaling)
Normalization scales the features of a dataset to a specific range, typically between 0 and 1. This is achived by subtracting the minimum value of the feature from respective feature and then dividing by the range.
The formula for normalization is:
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)
array([[0.5 , 0. , 1. ],
[1. , 0.5 , 0.33333333],
[0. , 1. , 0. ]])
Implementations from Scratch
First, we will import the necessary libraries, load the dataset, and use the two features from the Iris dataset for the demonstration.
pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
X = data.iloc[:, 2:]
Standardization takes the mean as zero and the variance as one. The following code demonstrates how to standardize the dataset.
def standardize(X):
return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
X_std = standardize(X)
Normalization is a 0-1 scaling method where the minimum value is 0 and the maximum value is 1. The following code shows how to normalize the dataset.
def normalize(X):
return (X - np.min(X, axis=0)) / (np.max(X, axis=0) - np.min(X, axis=0))
X_norm = normalize(X)
The preprocessing results can be visualized using the following plotting method. The first plot shows the original dataset, the second plot shows the standardized dataset, and the third plot shows the normalized dataset.
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(16, 12))
ax = fig.add_subplot(2, 2, 1)
ax.scatter(X.iloc[:, 0], X.iloc[:, 1])
ax.set_title("Before Standardization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")
ax = fig.add_subplot(2, 2, 3)
ax.scatter(X_std.iloc[:, 0], X_std.iloc[:, 1])
ax.set_title("After Standardization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")
ax = fig.add_subplot(2, 2, 4)
ax.scatter(X_norm.iloc[:, 0], X_norm.iloc[:, 1])
ax.set_title("After Normalization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")
plt.show()
Top comments (0)