Standardization in Statistics

Overview of Standardization in Statistics

Standardization is a statistical technique used to transform the values in a dataset into a comparable form. Specifically, it refers to the operation of changing the scale (range) of the original data, converting the mean to 0 and the standard deviation to 1. This allows data with different scales or units to be compared on the same basis.

Why is Standardization Important?

Easier Comparison:
Standardization makes it easier to compare variables with different scales. For example, when comparing data with different scales such as income (in dollars) and age (in years), standardization allows analysis on the same basis.

Importance in Machine Learning:

Many machine learning algorithms (e.g., k-nearest neighbors, support vector machines, neural networks) operate based on distances between variables. Without standardization, variables with larger scales can disproportionately influence the results, making standardization crucial.

Aligning Data Distributions:

Standardization aligns the data distributions of variables with different distributions, making it easier for statistical models and machine learning models to learn stably.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Generate data
np.random.seed(42)
data1 = np.random.normal(50, 10, 1000)  # Normal distribution with mean 50 and standard deviation 10
data2 = np.random.normal(30, 5, 1000)   # Normal distribution with mean 30 and standard deviation 5

# Perform standardization
scaler = StandardScaler()
data1_standardized = scaler.fit_transform(data1.reshape(-1, 1)).flatten()
data2_standardized = scaler.fit_transform(data2.reshape(-1, 1)).flatten()

# Plot histograms of the original data
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.hist(data1, bins=30, alpha=0.6, color='blue', label='Original Data 1')
plt.hist(data2, bins=30, alpha=0.6, color='green', label='Original Data 2')
plt.title('Original Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()

# Plot histograms of the standardized data
plt.subplot(1, 2, 2)
plt.hist(data1_standardized, bins=30, alpha=0.6, color='blue', label='Standardized Data 1')
plt.hist(data2_standardized, bins=30, alpha=0.6, color='green', label='Standardized Data 2')
plt.title('Standardized Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()

plt.tight_layout()
plt.show()

Result of Code Execution

Step-by-Step Code Explanation

Importing Libraries:

numpy and matplotlib.pyplot are the main libraries used for data generation and visualization.
sklearn.preprocessing.StandardScaler is used to perform standardization (converting mean to 0 and standard deviation to 1).

Generating Data:

The random seed is set to ensure reproducibility.
Two normal distributions (mean 50, standard deviation 10, and mean 30, standard deviation 5) are generated with 1000 data points each.

Standardizing the Data:

The StandardScaler is used to standardize the data, making the mean 0 and the standard deviation 1.
The data is transformed into a 2D array with one column for standardization and then converted back to a 1D array after standardization.

Plotting the Histograms:

The graph size is set, and two subplots are created.

The left subplot displays the histograms of the original data.
The right subplot displays the histograms of the standardized data.
Transparency, color, labels, etc., are set to clearly display the histograms.

Adjusting and Displaying the Graph Layout:

The tight_layout function automatically adjusts the layout to prevent graphs from overlapping.
Finally, the graph is displayed to check the results.

Summary

This code standardizes data generated from two different normal distributions and visually compares the original and standardized data using histograms. Standardization centers the data around 0 and unifies the scale, as confirmed in the results.