Ever wonder how you can transform your data from an exponential or skewed distribution to a normal distribution? In this article, I will discuss the importance of why we use logarithmic transformation within a dataset, and how it is used to make better predicted outcomes from a linear regression model. This model can be represented by the following equation:
Y = B0 + 01x1 + 02x2 + …. + 0nxn
- Y is the predicted value
- B0 is the y-intercept
- 01,…,02 are the model parameters
- x1, x2,…,x3 are the feature values
Some properties of logarithms and exponential functions that you may find useful include:
- log(e) = 1
- log(1) = 0
- log(xr) = r log(x)
- log eA = A
- elogA = A
A regression model will have unit changes between the x and y variables, where a single unit change in x will coincide with a constant change in y. Taking the log of one or both variables will effectively change the case from a unit change to a percent change. This is especially important when using medium to large datasets. Another way to think about it is when taking a log of a dataset is transforming your model(s) to take advantage of statistical tools such as linear regression that improve on features that are normally distributed.
A logarithm is the base of a positive number. For example, the base10 log of 100 is 2, because 102 = 100. So the natural log function and the exponential function (ex) are inverses of each other.
Keynote: 0.1 unit change in log(x) is equivalent to 10% increase in X.
Logarithmic transformation is a convenient means of transforming a highly skewed variable into a more normalized dataset. When modeling variables with non-linear relationships, the chances of producing errors may also be skewed negatively. In theory, we want to produce the smallest error possible when making a prediction, while also taking into account that we should not be overfitting the model. Overfitting occurs when there are too many dependent variables in play that it does not have enough generalization of the dataset to make a valid prediction. Using the logarithm of one or more variables improves the fit of the model by transforming the distribution of the features to a more normally-shaped bell curve.
In python, let’s first import the necessary libraries that will be used to show how this all works. I have also imported a dataframe using King County’s housing data in Washington state. In the following 2 histograms, you can see the difference between the actual price of houses vs. the log price of houses using numpy.
import pandas as pd import numpy as np import matplotlib.pyplot as plt
df = pd.read_csv('kc_house_data.csv') df.hist('price',figsize=(8,5)) plt.title('Number of houses vs Price') plt.ylabel('Number of Houses') plt.xlabel("Price")
df['log_price'] = np.log(df['price']) df.hist('log_price',figsize=(8,5)) plt.title('Number of houses vs log(Price)') plt.ylabel('Number of Houses') plt.xlabel("log(Price)")
First, determining a target parameter and which dependent variables used will be indicative of whether a log transformation is necessary. Let’s take for example, our business case is presenting to Redfin, a real estate company, what the listing price should be of a home with certain features. For this model, we will use 2 features: the number of bathrooms (dependent(x)) + living area square-footage (dependent(x)) vs. housing prices (independent(y) or our target parameter).
If we graph a histogram for number of bathrooms, the data has a relatively normal distribution as most houses have between 1-4 bathrooms with a higher number of houses with 2 bathrooms. However, if we graph a histogram of living area sqft, we get a highly skewed representation of the data, as shown in the first plot above. This is basically stating that there are a high number of homes similar in price with only a few that are in the $10MM and above range. While looking at our example, taking the log of price and sqft of living would rescale the coefficients into a normal distribution curve. This effectively changes the range of the data into a natural logarithm.
Now, we import a library called statsmodels. From there, we want Ordinary Least Squares (OLS) regression, which is also called a linear regression model. The images below show the relationship of sqft of living and price. Figure.1 illustrates 4 graphs of similar metrics at a per unit scale, taking un-logged independent and dependent variables. You will notice that it has a cone shape where the data points essentially scatter off as we increase in square-footage. Figure.2 shows the changes when a log transformation is executed, and we can now see the relationship as a percent change. By applying the logarithm to your variables, there is a much more distinguished and or adjusted linear regression line through the base of the data points, resulting in a better prediction model.
import statsmodels.api as sm from statsmodels.formula.api import ols f = 'price~sqft_living' model = ols(formula=f, data=df).fit() fig = plt.figure(figsize =(15,8)) fig = sm.graphics.plot_regress_exog(model, 'sqft_living', fig=fig)
df['log_bedrooms'] = np.log(df['bedrooms']) f = 'log_price~log_sqft_living' model_log = ols(formula=f, data=df).fit() fig = plt.figure(figsize =(15,8)) fig = sm.graphics.plot_regress_exog(model_log,'log_sqft_living', fig=fig)
Once the data has been modeled, we can print a summary of some useful metrics by typing:
I extracted a few values from the table for reference. R-squared is the percentage of the response variable variation that is explained by a linear model. So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs. Too high of an R-squared means that the model is overfit and there is some bias. In this case, we have a slightly better R-squared when we do a log transformation, which is a positive sign!
To put our results into a business case, lets do the following:
y = 312.681 * np.log(1.1) = 29.80
y = 312.681 * 0.095 = 29.80
"Approximately every 10% increase in sqft of living space will result in an increase of $29.80 in house value."
As the above statement is a bit hard to conceptualize, by multiplying the coefficient with log(2), we can get the per 1 sqft value.
y = 312.681 * np.log(2) = 216.73
"Approximately every 1 sqft of living space added will increase house value by $216.73."
Since taking the log of a dataset is transforming it into its natural base number; Inversely, to use the metrics of the log functions, we would just take the exponent (
np.exp()) of the coefficient(s).
When looking at prediction accuracy and minimizing any of the errors pertaining to your project such as mean absolute error (MAE), mean squared error (MSE), and or root mean squared error (RMSE), always check to see if your data is skewed in any way. If it is, log it!