Feature Scaling in Machine Learning

sagar borse — Fri, 04 Oct 2024 17:57:45 +0000

Feature scaling is a technique used in data preprocessing to ensure that all the features (independent variables) in a dataset have a similar scale. This is important because many machine learning algorithms perform better prediction when the features are on a similar scale.

Example: we have a dataset with two features: height (in centimeters) and weight (in kilograms). If we don’t scale these features, the algorithm might give more importance to the feature with larger values (height) just because of its scale, not because it’s more important.

Why use Feature Scaling?
Feature scaling is important in machine learning for several key reasons:

Improves Algorithm Performance: Many machine learning algorithms, like those using gradient descent, work better and converge faster when the features are on a similar scale.
Fair Contribution: Without scaling, features with larger ranges can dominate the learning process. Scaling ensures that all features contribute equally to the model.
Accurate Distance Calculations: Algorithms that rely on distance calculations, such as k-nearest neighbors (KNN) and support vector machines (SVM), need scaled features to compute accurate distances.
Consistent Units: When features are measured in different units (e.g., height in centimeters and weight in kilograms), scaling helps standardize them, making the model’s calculations more consistent.
Handling Outliers: Some scaling methods, like robust scaling, help reduce the impact of outliers by focusing on the median and interquartile range.

Algorithms That Need Feature Scaling:

1.Gradient Descent-Based Algorithms: These include linear regression, logistic regression, and neural networks. Scaling helps these algorithms converge faster.
2.Distance-Based Algorithms: Algorithms like k-nearest neighbors (KNN) and support vector machines (SVM) rely on distance calculations, which are affected by the scale of the features.
Principal Component Analysis (PCA): PCA is sensitive to the variances of the features, so scaling ensures that each feature contributes equally.
3.Clustering Algorithms: Algorithms like k-means clustering use distance measures, so scaling is important to ensure fair clustering.

Algorithms That Do Not Need Feature Scaling:

Tree-Based Algorithms: Decision trees, random forests, and gradient boosting algorithms (like XGBoost) are generally not affected by the scale of the features. These algorithms split the data based on feature values, so scaling is not necessary.
Naive Bayes: This algorithm is based on probability and is not influenced by the scale of the features.

Types of feature scaling:

1.Standardization (Also called as Z-score Normalization): Standardization is a technique used to transform data so that it has a mean of 0 and a standard deviation of 1. This process is particularly useful when the data follows a Gaussian (normal) distribution.
First, we should calculate the mean and standard deviation of the data we would like to normalize.

Then we are supposed to subtract the mean value from each entry and then divide the result by the standard deviation.

Formula: Z = X - mu / sigma

Where:

(X): is original feature value.

(mu):is the mean(average) of the feature.

(sigma): is the standard deviation of the feature.

(Z): is the standardized value.

Example with Age and Salary Features

Age=[25,35,45,55,65]

Salary = [50000,60000,70000,80000,90000]

Calculate the Mean(mu):
For Age: mean = 25 + 35 + 45 + 55 + 65 / 5 = 45

For Salary: mean = 50000 + 60000 + 70000 + 80000 + 90000 / 5 = 70000

Calculate the Standard Deviation(sigma):
For Age: sigma = (25-45)^2 + (35-45)^2 + (45-45)^2 + (55-45)^2 + (65-45)^2 / 5 = 14.14

For Salary: sigma = (50000-70000)^2 + (60000-70000)^2 + (70000-70000)^2 + (80000-70000)^2 + (90000-70000)^2 / 5 = 14142.14

Apply the Formula:
For Age:

X1 = 25 - 45 / 14.14 = -1.41

X1 = 35 - 45 / 14.14 = -0.71

X1= 45 - 45 / 14.14 = 0

X1 = 55 - 45 / 14.14 = 0.71

X1 = 65 - 45 / 14.14 = 1.41

For Salary:

X2 = 50000 - 70000 / 14142.14 = -1.41

X2 = 60000 - 70000 / 14142.14 = -0.71

X2 = 70000 - 70000 / 14142.14 = 0

X2 = 90000 - 70000 / 14142.14 = 1.41

Standardized Data:The standardized values for Age and Salary are:

Age(Standardized) = [-1.41, -0.71, 0, 0.71, 1.41]

Salary(Standardized) = [-1.41, -0.71, 0, 0.71, 1.41]

2.Normalization(Min-Max Scaling)

Normalization scaling is a technique used to adjust the range of feature values in our data so they all fit within a specific scale, usually 0 to 1.

Why Use Normalization?

Equal Contribution: Ensures each feature contributes equally in the analysis, preventing larger scale features from dominating the results.
Improves Performance: Helps algorithms like KNN, SVM, and Neural Networks perform better and more efficiently by removing bias due to varying feature scales.
Smooth Convergence: Makes training algorithms converge faster and more reliably.

How It Works:

Identify Min and Max Values: Find the minimum and maximum values for each feature.
Apply the Formula: Rescale each feature using the formula:

Normalized Value = Original Value - Min / Max - Min

Example:

Imagine we have heights (cm) and weights (kg) of people:

Height: Min = 150, Max = 200

Weight: Min = 50, Max = 100

If a person is 180 cm tall and weighs 75 kg:

Normalized Height: 180 - 150 / 200 - 150 = 0.6

Normalized Weight: 75 - 50 / 100 - 50 = 0.5

This rescaling makes it easier for algorithms to process the data without any one feature overshadowing the others.

Source code example link : https://www.kaggle.com/code/sagarborse/notebookb6572850b7

Feature Engineering In Machine Learning

sagar borse — Wed, 02 Oct 2024 15:55:12 +0000

▶ What is Feature Engineering?
Feature engineering is the process of transforming raw data into useful features that can be used to improve the performance of machine learning models. Here feature is column name in dataset like Name, Age, Height.
Imagine we have a dataset with lots of information, like a spreadsheet with columns for age, height, weight, etc. Feature engineering involves:

Selecting the most important columns (features).
Creating new columns from the existing ones (like calculating BMI from height and weight).
Transforming data to make it more useful (like converting text to numbers).

▶ Why It’s Important:
1.Improves Model Accuracy: Better features help the model make more accurate predictions.

Simplifies Data: Makes complex data easier to understand and use.
Highlights Key Information: Brings out the most important aspects of the data. Example: If you have data about houses, we might create a new feature called “price per square foot” by dividing the price by the size of the house. This new feature can help the model understand the value of the house better.

▶ Types of feature engineering

Feature Transformation: Feature Transformation is the process of transforming the features into a more suitable representation for the machine learning model. This is done to ensure that the model can effectively learn from the data.
Feature Creation : Feature Creation is the process of generating new features based on domain knowledge or by observing patterns in the data. It is a form of feature engineering that can significantly improve the performance of a machine-learning model.
Feature Selection: Feature Selection is the process of selecting a subset of relevant features from the dataset to be used in a machine-learning model. It is an important step in the feature engineering process as it can have a significant impact on the model’s performance.
Feature Extraction: Feature Extraction is the process of creating new features from existing ones to provide more relevant information to the machine learning model. This is done by transforming, combining, or aggregating existing features.
Feature Scaling: Feature Scaling is the process of transforming the features so that they have a similar scale. This is important in machine learning because the scale of the features can affect the performance of the model.

MachineLearning #AI #ML #FeatureEngineering #100DaysOfMachineLearning #MLGuru

DEV Community: sagar borse

Feature Scaling in Machine Learning

Feature Engineering In Machine Learning

MachineLearning #AI #ML #FeatureEngineering #100DaysOfMachineLearning #MLGuru