Ensuring consistency in the numerical input data is crucial to enhancing the performance of machine learning algorithms. To achieve this uniformity, it is necessary to adjust the data to a standardized range.

Standardization and Normalization are both widely used techniques for adjusting data before feeding it into machine learning models.

In this article, you will learn how to utilize the `StandardScaler`

class to scale the input data.

## What is Standardization?

Before diving into the fundamentals of the StandardScaler class, you need to understand the standardization of the data.

**Standardization** is a data preparation method that involves adjusting the input (features) by first centering them (subtracting the mean from each data point) and then dividing them by the standard deviation, resulting in the data having a **mean of 0** and a **standard deviation of 1**.

The formula for standardization can be written like the following:

**standardized_val = ( input_value - mean ) / standard_deviation**

Assume you have a mean value of **10.4** and a standard deviation value of **4**. To standardize the value of **15.9**, put the given values into the equation as follows:

**standardized_val = ( 15.9 - 10.4 ) / 3****standardized_val = ( 5.5 ) / 4****standardized_val = 1.37**

The `StandardScaler`

stands out as a widely used tool for implementing data standardization.

## What is StandardScaler?

The `StandardScaler`

class provided by Scikit Learn applies the standardization on the input (features) variable, making sure they have a **mean of approximately** **0** and a **standard deviation of approximately** **1**.

It adjusts the data to have a standardized distribution, making it suitable for modeling and ensuring that no single feature disproportionately influences the algorithm due to differences in scale.

## Why Bother Using it?

Well, so far you've already understood the idea of using StandardScaler in machine learning but just to highlight, here are the primary reasons why you should use StandardScaler:

For the betterment of the performance of the machine learning models

Maintains the consistency of data points

Useful when working with machine learning algorithms that can be negatively influenced by differences in the scale of the features of the data.

## How to Use StandardScaler?

First, you should bring in the `StandardScaler`

class from the `sklearn.preprocessing`

module. After that, create an instance of the `StandardScaler`

class by using `StandardScaler()`

. Following that, apply the `fit_transform`

method to the input data by fitting it to the created instance.

```
# Imported required libs
import numpy as np
from sklearn.preprocessing import StandardScaler
# Creating a 2D array
arr = np.asarray([[12, 0.007],
[45, 1.5],
[75, 2.005],
[7, 0.8],
[15, 0.045]])
print("Original Array: \n", arr)
# Instance of StandardScaler class
scaler = StandardScaler()
# Fitting and then transforming the input data
arr_scaled = scaler.fit_transform(arr)
print("Scaled Array: \n", arr_scaled)
```

An instance of the `StandardScaler`

class is created and stored in the variable `scaler`

. This instance will be used to standardize the data.

The `fit_transform`

method of the `StandardScaler`

object (`scaler`

) is called with the original data `arr`

as the input.

The `fit_transform`

method will compute the mean and deviation for each data point in the input data `arr`

and then apply the standardization to the input data.

Here's the original array and the standardized version of the original array.

```
Original Array:
[[1.200e+01 7.000e-03]
[4.500e+01 1.500e+00]
[7.500e+01 2.005e+00]
[7.000e+00 8.000e-01]
[1.500e+01 4.500e-02]]
Scaled Array:
[[-0.72905466 -1.09507083]
[ 0.55066894 0.79634605]
[ 1.71405403 1.43610862]
[-0.92295217 -0.09045356]
[-0.61271615 -1.04693028]]
```

## Does Standardization Affect the Accuracy of the Model?

In this section, you'll see how the model's performance is affected after applying standardization to features of the dataset.

Let's see how the model will perform on the raw dataset without standardizing the feature variables.

```
# Evaluate KNN on the breast cancer dataset
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from numpy import mean
# load dataset
df = datasets.load_breast_cancer()
X = df.data
y = df.target
# Instantiating the model
model = KNeighborsClassifier()
# Evaluating the model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=10, n_jobs=-1)
# Model's average score
print(f'Accuracy: {mean(scores):.2f}')
```

The breast cancer dataset is loaded from the `sklearn.datasets`

and then the features (`df.data`

) and target (`df.target`

) are stored inside the `X`

and `y`

variables.

The K-nearest neighbors classifier (KNN) model is instantiated using the `KNeighborsClassifier`

class and stored inside the model variable.

The `cross_val_score`

function is used to evaluate the KNN model's performance. It passes the model (`KNeighborsClassifier()`

), features (`X`

), target (`y`

), and specifies that accuracy (`scoring='accuracy'`

) should be used as the evaluation metric.

This will evaluate the accuracy scores by dividing the dataset equally into 10 parts (`cv=10`

) which means the dataset will be trained and tested 10 times. Here, `n_jobs=-1`

means using all the available CPU cores for faster cross-validation.

Finally, the average of the accuracy scores (`mean(scores)`

) is printed.

```
Accuracy: 0.93
```

Without standardizing the dataset's feature variables, the average accuracy score is **93%**.

### Using StandardScaler for Applying Standardization

```
# Evaluate KNN on the breast cancer dataset
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from numpy import mean
# loading dataset and configuring features and target variables
df = datasets.load_breast_cancer()
X = df.data
y = df.target
# Standardizing features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Instantiating model
model = KNeighborsClassifier()
# Evaluating the model
scores = cross_val_score(model, X_scaled, y, scoring='accuracy', cv=10, n_jobs=-1)
# Model's average score
print(f'Accuracy: {mean(scores):.2f}')
```

The dataset's features undergo scaling with the `StandardScaler()`

, and the resulting scaled dataset is stored in the `X_scaled`

variable.

Next, this scaled dataset is used as input for the `cross_val_score`

function to compute and subsequently display the accuracy.

```
Accuracy: 0.97
```

It is noticeable that the accuracy score has significantly increased to **97%** when compared to the previous accuracy score of **93%**.

The application of `StandardScaler()`

, which standardized the data's features, has notably improved the model's performance.

## Conclusion

**StandardScaler** is used to standardize the input data in a way that ensures that the data points have a balanced scale, which is crucial for machine learning algorithms, especially those that are sensitive to differences in feature scales.

Standardization transforms the data such that the mean of each feature becomes zero (centered at zero), and the standard deviation becomes one.

Let's recall what you've learned:

What actually is

**StandardScaler**What is

**standardization**and how it is applied to the data pointsImpact of

**StandardScaler**on the**model's performance**

π**Other articles you might be interested in if you liked this one**

β How do learning rates impact the performance of the ML and DL models?

β How to build a custom deep learning model using transfer learning?

β How to build a Flask image recognition app using a deep learning model?

β How to join, combine, and merge two different datasets using pandas?

β How to perform data augmentation for deep learning using Keras?

β Upload and display images on the frontend using Flask in Python.

β What are Sessions and how to use them in a Flask app as temporary storage?

**That's all for now**

**Keep Codingββ**

## Top comments (0)