Anand

Posted on Jun 12

ꜱᴄᴀʟɪɴɢ ᴀɴᴅ ɴᴏʀᴍᴀʟɪᴢɪɴɢ ᴅᴀᴛᴀ ꜰᴏʀ ᴍᴀᴄʜɪɴᴇ ʟᴇᴀʀɴɪɴɢ ᴍᴏᴅᴇʟꜱ

#machinelearning #datascience #python #data

Scaling and Normalizing Data for Machine Learning Models 🐍🤖

In the world of machine learning, scaling and normalizing your data are crucial preprocessing steps before feeding it into models. Proper scaling ensures that each feature contributes equally to the result, while normalization often improves the performance of the algorithm. In this post, we'll explore these concepts in detail, focusing on methods provided by the scikit-learn module. We'll also provide code snippets and formulas for clarity.

Why Scale and Normalize ❓

Improves Model Performance: Many machine learning algorithms perform better when features are on a similar scale. For instance, algorithms like SVM and KNN are sensitive to the scales of the features.
Faster Convergence: Gradient descent converges faster with scaled features.
Reduces Bias: Unscaled features can cause bias in the model towards features with a larger range.

Scaling Techniques

Standardization (Z-score Normalization)

Standardization scales the data to have a mean of zero and a standard deviation of one.

The formula is: z = (x - μ) / σ

Where:

x is the original value
μ is the mean of the feature
σ is the standard deviation of the feature

Code Example

from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Standardizing the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

print("Standardized Data:\n", standardized_data)

output

Standardized Data:
 [[-1.34164079 -1.34164079]
 [-0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079]]

Min-Max Scaling (Normalization)

Min-Max scaling scales the data to a fixed range, usually [0, 1].

The formula is: x' = (x - x_min) / (x_max - x_min)

Where:

x is the original value
x_min is the minimum value of the feature
x_max is the maximum value of the feature

Code Example


from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

print("Normalized Data:\n", normalized_data)

output

Normalized Data:
 [[0.         0.        ]
 [0.33333333 0.33333333]
 [0.66666667 0.66666667]
 [1.         1.        ]]

Normalization Techniques

L2 Normalization

L2 normalization scales each data point such that the Euclidean norm (L2 norm) of the feature vector is 1.

The formula is: x' = x / ||x||_2
Where ||x||_2 is the L2 norm of the feature vector.

Code Example


from sklearn.preprocessing import Normalizer

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data using L2 norm
normalizer = Normalizer(norm='l2')
l2_normalized_data = normalizer.fit_transform(data)

print("L2 Normalized Data:\n", l2_normalized_data)

output

L2 Normalized Data:
 [[0.4472136  0.89442719]
 [0.6        0.8       ]
 [0.6401844  0.76822128]
 [0.65850461 0.75257669]]

L1 Normalization

L1 normalization scales each data point such that the Manhattan norm (L1 norm) of the feature vector is 1. The formula is:

formula : x' = x / ||x||_1

Where ||x||_1 is the L1 norm of the feature vector.

Code Example


from sklearn.preprocessing import Normalizer

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data using L1 norm
normalizer = Normalizer(norm='l1')
l1_normalized_data = normalizer.fit_transform(data)

print("L1 Normalized Data:\n", l1_normalized_data)

output

L1 Normalized Data:
 [[0.33333333 0.66666667]
 [0.42857143 0.57142857]
 [0.45454545 0.54545455]
 [0.46666667 0.53333333]]

example code :

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the iris dataset
data = load_iris()
X, y = data.data, data.target

# Normalize the features
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3, random_state=42)

# Fit a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model
score = model.score(X_test, y_test)
print(f"Model Accuracy: {score:.2f}")

output : Model Accuracy : 0.91

Conclusion
Scaling and normalizing your data are fundamental steps in preparing it for machine learning models. scikit-learn provides convenient and efficient tools for both scaling and normalization. Here’s a quick summary of the methods discussed:

Standardization: Adjusts the data to have a mean of 0 and a standard deviation of 1.
Min-Max Scaling: Scales the data to a fixed range, usually [0, 1].
L2 Normalization: Scales the data so that the L2 norm of each row is 1.
L1 Normalization: Scales the data so that the L1 norm of each row is 1.

→ By correctly applying these techniques, you can improve the performance and convergence of your machine learning models.

About Me:
🖇️LinkedIn
🧑‍💻GitHub

DEV Community

ꜱᴄᴀʟɪɴɢ ᴀɴᴅ ɴᴏʀᴍᴀʟɪᴢɪɴɢ ᴅᴀᴛᴀ ꜰᴏʀ ᴍᴀᴄʜɪɴᴇ ʟᴇᴀʀɴɪɴɢ ᴍᴏᴅᴇʟꜱ

Scaling and Normalizing Data for Machine Learning Models 🐍🤖

Why Scale and Normalize ❓

Scaling Techniques

Standardization (Z-score Normalization)

Code Example

Min-Max Scaling (Normalization)

Code Example

Normalization Techniques

L2 Normalization

Code Example

L1 Normalization

Code Example

Top comments (0)

Read next

Advent of Code '24 - Day9: Disk Fragmenter (Python)

🧽 Cleaning up Security Hub with AWS Resource Explorer 🫧

Advent of Code '24 - Day 13 Claw Contraption

A Power-Filled IDE for Neovim with Sane Defaults