DEV Community

Cover image for How to scale attributes with normalization and standardization
Rodolfo Mendes
Rodolfo Mendes

Posted on • Updated on • Originally published at reinforcementlearning4.fun

How to scale attributes with normalization and standardization

Attributes in different scales are common in Machine Learning projects. For example, a medical record dataset can include in the columns weight, height, and blood pressure. These attributes have different units of measure and vary in different intervals, making their comparison difficult.

In these cases, we can apply a process called scaling to make this comparison easier. In this process, we change the original data but keeping the relative distance between the data points, so that we preserve the attribute distribution.

Normalization

In normalization, we scale an attribute by making all data points fit in the interval between 0.0 and 1.0. We express the normalization process using the formula:

Xnorm=XXminxmaxxminX^{norm} = \frac{X - X_{min}}{x_{max} - x_{min}}

where:

  • XnormX^{norm}
    : is the new scaled attribute
  • XminX_{min}
    : is a column vector where all elements are equal to
    xminx_{min}
  • xminx_{min}
    : is the minimum value of the $X$ attribute
  • xmaxx_{max}
    : is the maximum value of the $X$ attribute 

For each attribute value, we subtract its minimum and then divide the result by the difference of its maximum and its minimum. The mean value and standard deviation will be scaled as well, but the transforming will keep the data distribution.

Standardization

In standardization, the attribute is transformed to have a mean equals 0 (zero), and the standard deviation equals 1 (one). The following formula is applied:

Xstd=XμσX^{std} = \frac{X - \mu}{\sigma}

Where:

  • $X^{std}$: is the new scaled attribute
  • $X$: is a column vector representing our attribute
  • $\mu$: is a column vector where all elements are the mean value of the attribute
  • $\sigma$: is the standard deviation of the attribute

In standardization, there are no lower and upper limits for the new data values. But all of them are now expressed as unitarian distances from the mean.

Example

In the project Scaling attributes with normalization and standardization, we use vectorized operations to apply normalization and standardization to scale the attributes of the House Data Pricing dataset.

For example, we used the following code to apply normalization to the LotFrontage attribute:

df_norm['LotFrontage'] = (df_float['LotFrontage'] - df_float['LotFrontage'].min()) / (df_float['LotFrontage'].max() - df_float['LotFrontage'].min())

We also applied standardization to the same attributes and saved the result in a separated DataFrame:

df_std['LotFrontage'] = (df_float['LotFrontage'] - df_float['LotFrontage'].mean()) / df_float['LotFrontage'].std()

After scaling the attributes, we create linear regression models for each dataset and compared the results. 

Real data vs linear regression models

Conclusion

After comparing the score of the models, we concluded that the scaling of attributes by themselves does not improve the linear regression models. Check the complete example in the link below:  

Scaling attributes with normalization and standardization

Top comments (5)

Collapse
 
ad0791 profile image
Alexandro Disla

You should have done this

$$ X^{std} = \frac{X - \mu}{\sigma} $$

for a one line equation.

and $ X^{std} = \frac{X - \mu}{\sigma} $ for an inline equation

Collapse
 
ad0791 profile image
Alexandro Disla

Probably dev.to doesn't support proper markdown and latex syntax

Collapse
 
ad0791 profile image
Alexandro Disla

Nice Work. You can use other kind of transformation in your data as well.

if x is your dataset

  1. log transform : x = log(x)
  2. square root transform: x=sqrt(x)
 
ad0791 profile image
Alexandro Disla

That’s curious

Thread Thread
 
rodolfomendes profile image
Rodolfo Mendes

Thanks @lukaszahradnik !