DEV Community

Anna Zubova
Anna Zubova

Posted on

King County House Sales dataset: log-transformations and interpreting results

Log transformations turn out to be very useful when working with linear regression models. They are very helpful in correcting skewness in the data, and allow to make the distribution more normal.

Alexander Bailey and I worked on a dataset of housing prices in King County in order to develop a linear regression model that explains price variations.

We tried log transformation on several continuous variables but we had best results applying log transformation to the set of distance from major central locations: downtown Seattle, downtown Bellevue, and South Lake Union. The distance to each of these locations was expressed in miles.

Let’s look at one of the variables: distance in miles from downtown Seattle. This is how the distribution looked before and after the log transformation:

Distance from Seattle downtown distribution

We can see that the log transformation corrected the skewness and made the distribution more normal, which was beneficial for linear regression model performance. By taking the natural logarithm of parameter values, we were able to improve our model’s metrics: R squared from 0.627 to 0.686 and MAE from 134016.02 to 131342.48 in one of model fitting iterations.

In case of the distance from downtown Bellevue, the distribution was also significantly improved by taking the natural logarithm of the variable values:

Distance from Bellevue downtown distribution

Change in distribution after log-transformation of the distance from South Lake Union parameter:

Distance from South Lake Union distribution

Log-transformed distance parameters in the linear regression model

Our final model for the housing prices focuses on making predictions for the prices up to approximately $1.1 million. Distance from downtown Seattle, downtown Bellevue, and South Lake Union are among the strongest predictors of the price.

Here are the coefficients for the distance values in our model:

Coefficient name Value
'mi_from_downtown_log' 210431.83668516215
'mi_from_bellevue_log' -109855.7636174455
'mi_from_south_lake_log' -294919.7693134867

The general explanation of the coefficient values is that, an increase in one unit of the 'mi_from_downtown_log' variable, assuming other variables in the model remain constants, would result in an increase on the housing price by $210,431.84 on average. However, the difficulty is that the explaining variable was log-transformed.

Distance from downtown parameter

Let’s simulate how the increase of the distance in miles from downtown Seattle would affect housing prices, assuming that other variables are constant.

First, let’s generate an array of numbers from 1 to 100, representing an increase of distance from downtown Seattle. Next step is to take the log value of these numbers that resulted in an array in range from 0 to 1.6. We then calculate y values (that is, our predicted price) for each of the values of log(x) using the coefficient of 210,431.84 from our linear regression model.

miles_range = np.arange(1, 101)
miles_range_log = np.log(miles_range)
y = 210431.83668516215 * miles_range_log

Here is the plot representing the relationship between y and log(x) compared to relationship between y and x:

fig, axs = plt.subplots(2, 1, constrained_layout=True)

axs[0].plot(miles_range_log, y)
axs[0].set_title('Log transformed variable')
axs[0].set_xlabel('log(mi_from_downtown)')
axs[0].set_ylabel('price')

axs[1].plot(miles_range, y)
axs[1].set_title('Raw data')
axs[1].set_xlabel('mi_from_downtown')
axs[1].set_ylabel('price');

raw vs log-transformed data

We can see that log-transforming this variable converted the relationship from exponential to linear, which serves the purpose of improving metrics of the linear regression.

Interestingly, in the case of the distance from downtown Seattle, the correlation is positive, which is the opposite to what we had expected. However, there is a negative correlation of housing prices with distance from downtown Bellevue and South Lake Union. The explanation might be that people would prefer to be further away from downtown Seattle in favor of proximity to other points.

Distance from downtown Bellevue parameter

According to our model the coefficient is -109855.76.

Let’s build an array of y values:

y_bellevue = -109855.76 * miles_range_log

Plotting the log(x) and x vs y:

fig, axs = plt.subplots(2, 1, constrained_layout=True)

axs[0].plot(miles_range_log, y_bellevue)
axs[0].set_title('Log transformed variable')
axs[0].set_xlabel('log(mi_from_bellevue)')
axs[0].set_ylabel('price')

axs[1].plot(miles_range, y_bellevue)
axs[1].set_title('Raw data')
axs[1].set_xlabel('mi_from_bellevue')
axs[1].set_ylabel('price');

raw vs log-transformed data

Here the relationship is more logical: the further away a house is from Bellevue, the lower will be its price.

Distance from South Lake Union parameter

The coefficient for this parameter is -294919.77, which is the most significant between all three distance parameters.

Calculating an array with y values:

y_s_lake_union = -294919.7693134867 * miles_range_log

Visualization of the x and log(x) vs y:
raw vs log-transformed data

Interpreting the results

When log transformations are done on a dataset, it can be difficult to explain to the non-technical audience how the model works. Here is how it can be done using our model example.

Let’s look at the distance from South Lake Union as it is the most significant variable out of three distance parameters. The coefficient for this value is -294919.76, which in case of non-log-transformed variable would mean that an increase in distance from South Lake Union by 1 mile would be associated with a $294,919.76 decrease of price on average.

However, the values of the price are log-transformed, so we can’t use this 1-unit-increase technique. In the case of logarithmic data transformations, we can talk about percentage changes.

To find out what the increase in target price would be, let’s look at the equation:

price(x1) - price(x0) = coef * log(x1) - coef*log(x0) = coef * (log(x1) - log(x0)) = coef * log(x1/x0)

So in our case to find out what would be the change of price resulting from an increase of the distance by 10%, we will have to calculate the following:

change_in_price = -294,919.76 * log(1.1)

change_in_price = -294,919.76 * 0.95 = -28,108.85

Based on the above calculation, with an approximately every 530 ft (10% of a mile) increase in distance from South Lake Union, the price will decrease by $28,108.85.

If we want to know the effect of increasing the distance by 1 mile, we would need to do the following calculation:

change_in_price = -294,919.76 * log(2) = -204422.8

As a side note, it is helpful to know that if we are talking about small percentage of changes in x value (up to 5%), the increase of x by 5%, for example, is almost equivalent to adding 0.05 to the log(x). Similarly, the increase of x by 2%, would be almost equivalent to adding 0.02 to the log(x).

Discussion (0)