DEV Community

Cover image for ꜱᴄᴀʟɪɴɢ ᴀɴᴅ ɴᴏʀᴍᴀʟɪᴢɪɴɢ ᴅᴀᴛᴀ ꜰᴏʀ ᴍᴀᴄʜɪɴᴇ ʟᴇᴀʀɴɪɴɢ ᴍᴏᴅᴇʟꜱ
Anand
Anand

Posted on

2 1 1 1 1

ꜱᴄᴀʟɪɴɢ ᴀɴᴅ ɴᴏʀᴍᴀʟɪᴢɪɴɢ ᴅᴀᴛᴀ ꜰᴏʀ ᴍᴀᴄʜɪɴᴇ ʟᴇᴀʀɴɪɴɢ ᴍᴏᴅᴇʟꜱ

Scaling and Normalizing Data for Machine Learning Models 🐍🤖

In the world of machine learning, scaling and normalizing your data are crucial preprocessing steps before feeding it into models. Proper scaling ensures that each feature contributes equally to the result, while normalization often improves the performance of the algorithm. In this post, we'll explore these concepts in detail, focusing on methods provided by the scikit-learn module. We'll also provide code snippets and formulas for clarity.

Image

Why Scale and Normalize ❓

  1. Improves Model Performance: Many machine learning algorithms perform better when features are on a similar scale. For instance, algorithms like SVM and KNN are sensitive to the scales of the features.
  2. Faster Convergence: Gradient descent converges faster with scaled features.
  3. Reduces Bias: Unscaled features can cause bias in the model towards features with a larger range.

Scaling Techniques

Standardization (Z-score Normalization)

Standardization scales the data to have a mean of zero and a standard deviation of one.

The formula is: z = (x - μ) / σ

Where:

  • x is the original value
  • μ is the mean of the feature
  • σ is the standard deviation of the feature

Code Example

from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Standardizing the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

print("Standardized Data:\n", standardized_data)
Enter fullscreen mode Exit fullscreen mode

output

Standardized Data:
 [[-1.34164079 -1.34164079]
 [-0.4472136  -0.4472136 ]
 [ 0.4472136   0.4472136 ]
 [ 1.34164079  1.34164079]]
Enter fullscreen mode Exit fullscreen mode

Min-Max Scaling (Normalization)

Min-Max scaling scales the data to a fixed range, usually [0, 1].

The formula is: x' = (x - x_min) / (x_max - x_min)

Where:

  • x is the original value
  • x_min is the minimum value of the feature
  • x_max is the maximum value of the feature

Code Example


from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

print("Normalized Data:\n", normalized_data)
Enter fullscreen mode Exit fullscreen mode

output

Normalized Data:
 [[0.         0.        ]
 [0.33333333 0.33333333]
 [0.66666667 0.66666667]
 [1.         1.        ]]
Enter fullscreen mode Exit fullscreen mode

Normalization Techniques

L2 Normalization

L2 normalization scales each data point such that the Euclidean norm (L2 norm) of the feature vector is 1.

The formula is: x' = x / ||x||_2
Where ||x||_2 is the L2 norm of the feature vector.

Code Example


from sklearn.preprocessing import Normalizer

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data using L2 norm
normalizer = Normalizer(norm='l2')
l2_normalized_data = normalizer.fit_transform(data)

print("L2 Normalized Data:\n", l2_normalized_data)
Enter fullscreen mode Exit fullscreen mode

output

L2 Normalized Data:
 [[0.4472136  0.89442719]
 [0.6        0.8       ]
 [0.6401844  0.76822128]
 [0.65850461 0.75257669]]
Enter fullscreen mode Exit fullscreen mode

L1 Normalization

L1 normalization scales each data point such that the Manhattan norm (L1 norm) of the feature vector is 1. The formula is:

formula : x' = x / ||x||_1

Where ||x||_1 is the L1 norm of the feature vector.

Code Example


from sklearn.preprocessing import Normalizer

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Normalizing the data using L1 norm
normalizer = Normalizer(norm='l1')
l1_normalized_data = normalizer.fit_transform(data)

print("L1 Normalized Data:\n", l1_normalized_data)
Enter fullscreen mode Exit fullscreen mode

output

L1 Normalized Data:
 [[0.33333333 0.66666667]
 [0.42857143 0.57142857]
 [0.45454545 0.54545455]
 [0.46666667 0.53333333]]
Enter fullscreen mode Exit fullscreen mode

example code :

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the iris dataset
data = load_iris()
X, y = data.data, data.target

# Normalize the features
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3, random_state=42)

# Fit a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model
score = model.score(X_test, y_test)
print(f"Model Accuracy: {score:.2f}")
Enter fullscreen mode Exit fullscreen mode

output : Model Accuracy : 0.91

Conclusion
Scaling and normalizing your data are fundamental steps in preparing it for machine learning models. scikit-learn provides convenient and efficient tools for both scaling and normalization. Here’s a quick summary of the methods discussed:

  • Standardization: Adjusts the data to have a mean of 0 and a standard deviation of 1.
  • Min-Max Scaling: Scales the data to a fixed range, usually [0, 1].
  • L2 Normalization: Scales the data so that the L2 norm of each row is 1.
  • L1 Normalization: Scales the data so that the L1 norm of each row is 1.

→ By correctly applying these techniques, you can improve the performance and convergence of your machine learning models.


About Me:
🖇️LinkedIn
🧑‍💻GitHub

Image of AssemblyAI tool

Transforming Interviews into Publishable Stories with AssemblyAI

Insightview is a modern web application that streamlines the interview workflow for journalists. By leveraging AssemblyAI's LeMUR and Universal-2 technology, it transforms raw interview recordings into structured, actionable content, dramatically reducing the time from recording to publication.

Key Features:
🎥 Audio/video file upload with real-time preview
🗣️ Advanced transcription with speaker identification
⭐ Automatic highlight extraction of key moments
✍️ AI-powered article draft generation
📤 Export interview's subtitles in VTT format

Read full post

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay