Ricky

Posted on Oct 20, 2023

Using machine learning to predict the selling price of a property

#machinelearning #ai #datascience #python

Regression

Regression is a fundamental machine learning algorithm used for modeling the relationship between a dependent variable (target) and one or more independent variables (features).

Let's build a regression machine learning model and use it to predict the selling price of a property using sales data from Connecticut.

We-ll use scikit-learn’s RandomForestRegressor to train the model and evaluate it on a test set

The overall objective is to predict the sale amount of property using 5 feature variables and 1 target variable:

See this repo for the Jupyter notebook

Feature Variables

List Year: The year the property was listed
Town: The Town in which the property is
Assessed Value: Reported assessed value
Sales Ratio: The ratio of the sale value to the assessed value
Property Type: The type of property

Target Variable

Sale Amount: The reported sale amount.

Dependencies

pandas - To work with solid data-structures, n-dimensional matrices and perform exploratory data analysis.
matplotlib - To visualize data using 2D plots.
scikit-learn - To create machine learning models easily and make predictions.
numpy - To work with arrays.

Install dependencies, obtain & load data

%pip install -U pandas matplotlib numpy scikit-learn

import pandas as pd
import seaborn as sns 

import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

# Read in the data with read_csv() into a Pandas Dataframe
data_path = '/path_to_csv_file/data.csv'
sales_df = pd.read_csv(data_path)

# Use .info() to show the features (i.e. columns) in the dataset along with a count and datatype
sales_df.info()

/var/folders/21/_103sf7x3zgg4yk3syyg448h0000gn/T/ipykernel_60538/748677588.py:3: DtypeWarning: Columns (8,9,10,11,12) have mixed types. Specify dtype option on import or set low_memory=False.
  sales_df = pd.read_csv(data_path)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997213 entries, 0 to 997212
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Serial Number     997213 non-null  int64  
 1   List Year         997213 non-null  int64  
 2   Date Recorded     997211 non-null  object 
 3   Town              997213 non-null  object 
 4   Address           997162 non-null  object 
 5   Assessed Value    997213 non-null  float64
 6   Sale Amount       997213 non-null  float64
 7   Sales Ratio       997213 non-null  float64
 8   Property Type     614767 non-null  object 
 9   Residential Type  608904 non-null  object 
 10  Non Use Code      289681 non-null  object 
 11  Assessor Remarks  149864 non-null  object 
 12  OPM remarks       9934 non-null    object 
 13  Location          197697 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 106.5+ MB

Data wrangling

The raw data has more data points than we're interested in and the data has some missing values.

We need to clean up the data to a form the model can understand

# Get the counts of the null values in each column
sales_df.isnull().sum()

Serial Number            0
List Year                0
Date Recorded            2
Town                     0
Address                 51
Assessed Value           0
Sale Amount              0
Sales Ratio              0
Property Type       382446
Residential Type    388309
Non Use Code        707532
Assessor Remarks    847349
OPM remarks         987279
Location            799516
dtype: int64

We're going to begin by dropping all the columns we don't need for this problem.
This includes: ['Serial Number','Date Recorded', 'Address', 'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks', 'Location']
Moreover, a lot of these fields have null values.
We'll keep 'Property Type' but we'll drop all null values and later on we'll see how to handle such data

# drop the unnecessary columns
columns_to_drop = ['Serial Number','Date Recorded', 'Address', 'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks', 'Location']

sales_df = sales_df.drop(columns=columns_to_drop)
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997213 entries, 0 to 997212
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   List Year       997213 non-null  int64  
 1   Town            997213 non-null  object 
 2   Assessed Value  997213 non-null  float64
 3   Sale Amount     997213 non-null  float64
 4   Sales Ratio     997213 non-null  float64
 5   Property Type   614767 non-null  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 45.6+ MB

# remove all rows with null Property Type
sales_df = sales_df.dropna(subset=['Property Type'])
sales_df.isnull().sum()

List Year         0
Town              0
Assessed Value    0
Sale Amount       0
Sales Ratio       0
Property Type     0
dtype: int64

Great! now all rows and all colums have data. lets see the shape of the data

# print shape of dataframe
sales_df.shape

(614767, 6)

# as a python convention, lets rename the column names to be in snake_case
sales_df = sales_df.rename(columns=lambda x: x.lower().replace(' ', '_'))
sales_df.columns

Index(['list_year', 'town', 'assessed_value', 'sale_amount', 'sales_ratio',
       'property_type'],
      dtype='object')

Visualizations

Histograms

# let's plot a histogram for the all the features to understand the data distributions
# from the graph we can clearly see there are outliers in the data
# this can skew the accuracy of the model
sales_df.plot(figsize=(20,15))

Clean data to remove outliers

We can set a few conditions on the data in order to remove outliers

Filter sale amount and assessed value to values less than 100M
Filter sale amount to be greater than 2000

We'll then compare the performance of two models trained using the cleaned and the raw data with outliers

sales_df_clean = sales_df[(sales_df['sale_amount'] < 1e8) & (sales_df['assessed_value'] < 1e8) & (sales_df['sale_amount'] > 2000)]

# plot the data again to see the difference
sales_df_clean.plot(figsize=(20,15))

(614767, 6)

Encode categorical data

We nead to encode the categorical data (town and property type) into numerical data using one-hot encoding


sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 614767 entries, 0 to 997211
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   list_year       614767 non-null  int64  
 1   town            614767 non-null  object 
 2   assessed_value  614767 non-null  float64
 3   sale_amount     614767 non-null  float64
 4   sales_ratio     614767 non-null  float64
 5   property_type   614767 non-null  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 32.8+ MB

# let's count the number of unique categories for town and property type
sales_df["property_type"].value_counts()

property_type
Single Family     401612
Condo             105420
Residential        60728
Two Family         26408
Three Family       12586
Vacant Land         3163
Four Family         2150
Commercial          1981
Apartments           486
Industrial           228
Public Utility         5
Name: count, dtype: int64

# town
sales_df["town"].value_counts()

town
Bridgeport       18264
Waterbury        17938
Stamford         17374
Norwalk          14637
Greenwich        12329
                 ...  
Eastford           266
Scotland           258
Canaan             250
Union              121
***Unknown***        1
Name: count, Length: 170, dtype: int64

One-Hot Encoding

One-hot encoding is a technique used in machine learning and data preprocessing to convert categorical data into a binary format that can be fed into machine learning algorithms. Each category or label in a categorical variable is represented as a binary vector, where only one element in the vector is "hot" (set to 1) to indicate the category, while all others are "cold" (set to 0). This allows machine learning algorithms to work with categorical data, which typically requires numerical inputs. One-hot encoding is essential when working with categorical variables in tasks like classification and regression, as it ensures that the model can interpret and make predictions based on these variables.

# let's replace the town and property columns using pandas' get_dummies()
sales_df_encoded = pd.get_dummies(data=sales_df, columns=['property_type', 'town'])
sales_df_clean_encoded = pd.get_dummies(data=sales_df_clean, columns=['property_type', 'town'])

	list_year	assessed_value	sale_amount	sales_ratio	property_type_Apartments	property_type_Commercial	property_type_Condo	property_type_Four Family	property_type_Industrial	property_type_Public Utility	...	town_Willington	town_Wilton	town_Winchester	town_Windham	town_Windsor	town_Windsor Locks	town_Wolcott	town_Woodbridge	town_Woodbury	town_Woodstock
0	2020	150500.0	325000.0	0.4630	False	True	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	2020	253000.0	430000.0	0.5883	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	2020	130400.0	179900.0	0.7248	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	2020	619290.0	890000.0	0.6958	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	2020	862330.0	1447500.0	0.5957	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

5 rows × 185 columns

Train the models

import sklearn
from sklearn.model_selection import train_test_split

# Split target variable and feature variables
y = sales_df_encoded['sale_amount']
X = sales_df_encoded.drop(columns=['sale_amount'])
y_cleaned = sales_df_clean_encoded['sale_amount']
X_cleaned = sales_df_clean_encoded.drop(columns=['sale_amount'])

Split training & test data

We need to split our data into training data and test data in an 80:20 or 70:30 ratio depending on the use case.

The reasoning is that we'll use the training data to train the model and the test data to validate it. The model needs to perform well with test data that it has never seen

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, shuffle=True, test_size=0.2)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_cleaned, y_cleaned, random_state=42, shuffle=True, test_size=0.2)

# print the shapes of the new X objects
print(X_train.shape)
print(X_train_c.shape)

(491813, 184)
(491595, 184)

RandomForestRegressor

In scikit-learn, the RandomForestRegressor is a machine learning model that is used for regression tasks. It is an ensemble learning method that combines the predictions of multiple decision trees to make more accurate and robust predictions. When creating a RandomForestRegressor, you can specify several parameters to configure its behavior. Here are some of the common parameters passed to the RandomForestRegressor:

n_estimators: The number of decision trees in the forest. Increasing the number of trees can improve the model's performance, but it also increases computation time.
max_depth: The maximum depth of each decision tree in the forest. It limits the tree's growth and helps prevent overfitting.
random_state: A random seed for reproducibility. If you set this value, you'll get the same results each time you run the model with the same data.

These are some of the commonly used parameters, but the RandomForestRegressor has many more options for fine-tuning your model. You can choose the appropriate values for these parameters based on your specific regression task and dataset.

Random Forest is a non-parametric model, meaning it does not make strong assumptions about the underlying data distribution. It is highly flexible and can capture complex relationships in the data.

Overfitting

This is the tendency of a model to be very sensitive to outliers. Such a model performs very well on training data but poorly on unseen data

Underfitting

Underfitting occurs when a model is too simple or lacks the capacity to capture the underlying patterns in the data. Such a model generalizes too much and results in inaccurate results.

from sklearn.ensemble import RandomForestRegressor

# Create a  regressor using all the feature variables
# we will tweak the parameters later to improve the performance of the model
rf_model = RandomForestRegressor(n_estimators=10,random_state=10)

rf_model_clean = RandomForestRegressor(n_estimators=10,random_state=10)
# Train the models using the training sets
rf_model.fit(X_train, y_train)
rf_model_clean.fit(X_train_c, y_train_c)

Run the predictions

After training the model, we use it to make predictions based on the unseen test data

#run the predictions on the training and testing data
y_rf_pred_test = rf_model.predict(X_test)
y_rf_clean_pred_test = rf_model_clean.predict(X_test_c)

#compare the actual values (ie, target) with the values predicted by the model
rf_pred_test_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_rf_pred_test})

rf_pred_test_df

	Actual	Predicted
769037	47000.0	46960.0
615577	463040.0	462650.0
635236	137000.0	137246.0
901781	235000.0	235000.0
630118	305000.0	305250.0
...	...	...
679168	378000.0	377490.0
965471	179900.0	179780.0
459234	295000.0	295100.0
876950	13000.0	13117.6
652908	295595.0	295595.0

122954 rows × 2 columns

#compare the actual values (ie, target) with the values predicted by the model
rf_pred_clean_test_df = pd.DataFrame({'Actual': y_test_c, 'Predicted': y_rf_clean_pred_test})

rf_pred_clean_test_df

	Actual	Predicted
915231	195000.0	194800.0
51414	198000.0	198540.0
520001	410000.0	409900.0
18080	230000.0	230500.0
456006	260000.0	260100.0
...	...	...
673908	160000.0	160000.0
767435	379900.0	380383.2
2850	2450000.0	2463313.0
954634	250000.0	250000.0
849689	140500.0	140394.4

122899 rows × 2 columns

Evaluate the models

Now let's compare what the model produced and the actual results
We'll use R^2 method to evaluate the model's performance. This method is primarily used to evaluate the performance of regressor models.

# Determine accuracy uisng 𝑅^2
from sklearn.metrics import r2_score

score = r2_score(y_test, y_rf_pred_test)
score_clean = r2_score(y_test_c, y_rf_clean_pred_test)

print("R^2 test Raw - {}%".format(round(score, 2) *100))
print("R^2 test Clean - {}%".format(round(score_clean, 2) *100))

R^2 test Raw - 78.0%
R^2 test Clean - 95.0%

From this we can see a clear difference in performance between the cleaned data and the raw data that includes the outliers. This is a basic way to solve the problem and in another tutorial we'll compare different methods of eliminating outliers to reduce overfitting.

Feature Importance

Its important to findout which of the feature variables we passed was the most important in determining the target variable which is the sale amount in our case. Random Forest Regressor provides a method to determine this. Lets plot the 5 most important features


plt.figure(figsize=(10,6))
feat_importances = pd.Series(rf_model.feature_importances_, index = X_train.columns)
feat_importances.nlargest(5).plot(kind='barh', title='Feature Importance in Raw Data')


plt.figure(figsize=(10,6))
feat_importances = pd.Series(rf_model_clean.feature_importances_, index = X_train_c.columns)
feat_importances.nlargest(5).plot(kind='barh', title='Feature Importance in Clean Data')

Interestingly, according to the raw data, a property being in Willington town is a higher predictor of the score than the assessed value. This indicates that the data is being affected by outliers.

In the cleaned data, we can clearly see that the assessed value is the most important feature, which is what we would expect from the data.

Cross validation

Hyper parameters can be used to tweak the performance of the model. Cross validation can be used to evaluate the performance of the model with different hyperparameters.

Consider the n_estimators parameter which represents the number of decision tress in the forest. Lets see whether increasing it can increase model performance

We'll use Kfold cross validation to see what effect changing the n_estimators has on the MSE (Mean Square Error) and R^2 score of the model. This should help us determine the optimal value

Here's how KFold cross-validation works:

Splitting the Data: The dataset is divided into k roughly equal-sized folds, where "k" is the number of desired folds specified as an argument.
Training and Validation: The model is trained and validated k times, with each of the k folds serving as the validation set exactly once. The remaining k-1 folds collectively form the training set.
Metrics: For each iteration (or "fold"), the model's performance is assessed using a chosen evaluation metric (e.g., mean squared error, accuracy, etc.).
Average and Variance: After completing k iterations, the evaluation metric scores are typically averaged to obtain a single performance estimate for the model. This average score provides an estimate of the model's performance and generalization ability.

Key benefits of KFold cross-validation include:

It provides a more reliable estimate of a model's performance by using multiple validation sets.
It helps detect overfitting because the model is evaluated on different data subsets.
It allows for a more thorough assessment of model performance, especially with limited data.

from sklearn.model_selection import KFold, cross_val_score
# Define a range of n_estimators values to test
n_estimators_values = [10, 20]

# Set up cross-validation (e.g., 5-fold cross-validation)
kf = KFold(n_splits=3, shuffle=True, random_state=42)

results = {'n_estimators': [], 'MSE': [], 'R-squared': []}

for n_estimators in n_estimators_values:
    model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)

    # Perform cross-validation and calculate MSE and R-squared
    mse_scores = -cross_val_score(model, X_cleaned, y_cleaned, cv=kf, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(model, X_cleaned, y_cleaned, cv=kf, scoring='r2')

    results['n_estimators'].append(n_estimators)
    results['MSE'].append(mse_scores.mean())
    results['R-squared'].append(r2_scores.mean())

results_df = pd.DataFrame(results)
print(results_df)

   n_estimators           MSE  R-squared
0            10  3.120811e+10   0.959579
1            20  2.970360e+10   0.961509

As you can see, there isn't much to gain by increasing the n_estimators but the runtime of the model increases significantly.

So what if we use a different regression model other than Random Forest? say the extreme Gradient Boosting regressor. Is there more accuracy to gain? Stay tuned.

Forem