DEV Community: Ricky

Using machine learning to predict the selling price of a property

Ricky — Fri, 20 Oct 2023 10:42:22 +0000

Regression

Regression is a fundamental machine learning algorithm used for modeling the relationship between a dependent variable (target) and one or more independent variables (features).

Let's build a regression machine learning model and use it to predict the selling price of a property using sales data from Connecticut.

We-ll use scikit-learn’s RandomForestRegressor to train the model and evaluate it on a test set

The overall objective is to predict the sale amount of property using 5 feature variables and 1 target variable:

See this repo for the Jupyter notebook

Feature Variables

List Year: The year the property was listed
Town: The Town in which the property is
Assessed Value: Reported assessed value
Sales Ratio: The ratio of the sale value to the assessed value
Property Type: The type of property

Target Variable

Sale Amount: The reported sale amount.

Dependencies

pandas - To work with solid data-structures, n-dimensional matrices and perform exploratory data analysis.
matplotlib - To visualize data using 2D plots.
scikit-learn - To create machine learning models easily and make predictions.
numpy - To work with arrays.

Install dependencies, obtain & load data

%pip install -U pandas matplotlib numpy scikit-learn

import pandas as pd
import seaborn as sns 

import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

# Read in the data with read_csv() into a Pandas Dataframe
data_path = '/path_to_csv_file/data.csv'
sales_df = pd.read_csv(data_path)

# Use .info() to show the features (i.e. columns) in the dataset along with a count and datatype
sales_df.info()

/var/folders/21/_103sf7x3zgg4yk3syyg448h0000gn/T/ipykernel_60538/748677588.py:3: DtypeWarning: Columns (8,9,10,11,12) have mixed types. Specify dtype option on import or set low_memory=False.
  sales_df = pd.read_csv(data_path)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997213 entries, 0 to 997212
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Serial Number     997213 non-null  int64  
 1   List Year         997213 non-null  int64  
 2   Date Recorded     997211 non-null  object 
 3   Town              997213 non-null  object 
 4   Address           997162 non-null  object 
 5   Assessed Value    997213 non-null  float64
 6   Sale Amount       997213 non-null  float64
 7   Sales Ratio       997213 non-null  float64
 8   Property Type     614767 non-null  object 
 9   Residential Type  608904 non-null  object 
 10  Non Use Code      289681 non-null  object 
 11  Assessor Remarks  149864 non-null  object 
 12  OPM remarks       9934 non-null    object 
 13  Location          197697 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 106.5+ MB

Data wrangling

The raw data has more data points than we're interested in and the data has some missing values.

We need to clean up the data to a form the model can understand

# Get the counts of the null values in each column
sales_df.isnull().sum()

Serial Number            0
List Year                0
Date Recorded            2
Town                     0
Address                 51
Assessed Value           0
Sale Amount              0
Sales Ratio              0
Property Type       382446
Residential Type    388309
Non Use Code        707532
Assessor Remarks    847349
OPM remarks         987279
Location            799516
dtype: int64

We're going to begin by dropping all the columns we don't need for this problem.
This includes: ['Serial Number','Date Recorded', 'Address', 'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks', 'Location']
Moreover, a lot of these fields have null values.
We'll keep 'Property Type' but we'll drop all null values and later on we'll see how to handle such data

# drop the unnecessary columns
columns_to_drop = ['Serial Number','Date Recorded', 'Address', 'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks', 'Location']

sales_df = sales_df.drop(columns=columns_to_drop)
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997213 entries, 0 to 997212
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   List Year       997213 non-null  int64  
 1   Town            997213 non-null  object 
 2   Assessed Value  997213 non-null  float64
 3   Sale Amount     997213 non-null  float64
 4   Sales Ratio     997213 non-null  float64
 5   Property Type   614767 non-null  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 45.6+ MB

# remove all rows with null Property Type
sales_df = sales_df.dropna(subset=['Property Type'])
sales_df.isnull().sum()

List Year         0
Town              0
Assessed Value    0
Sale Amount       0
Sales Ratio       0
Property Type     0
dtype: int64

Great! now all rows and all colums have data. lets see the shape of the data

# print shape of dataframe
sales_df.shape

(614767, 6)

# as a python convention, lets rename the column names to be in snake_case
sales_df = sales_df.rename(columns=lambda x: x.lower().replace(' ', '_'))
sales_df.columns

Index(['list_year', 'town', 'assessed_value', 'sale_amount', 'sales_ratio',
       'property_type'],
      dtype='object')

Visualizations

Histograms

# let's plot a histogram for the all the features to understand the data distributions
# from the graph we can clearly see there are outliers in the data
# this can skew the accuracy of the model
sales_df.plot(figsize=(20,15))

Clean data to remove outliers

We can set a few conditions on the data in order to remove outliers

Filter sale amount and assessed value to values less than 100M
Filter sale amount to be greater than 2000

We'll then compare the performance of two models trained using the cleaned and the raw data with outliers

sales_df_clean = sales_df[(sales_df['sale_amount'] < 1e8) & (sales_df['assessed_value'] < 1e8) & (sales_df['sale_amount'] > 2000)]

# plot the data again to see the difference
sales_df_clean.plot(figsize=(20,15))

(614767, 6)

Encode categorical data

We nead to encode the categorical data (town and property type) into numerical data using one-hot encoding


sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 614767 entries, 0 to 997211
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   list_year       614767 non-null  int64  
 1   town            614767 non-null  object 
 2   assessed_value  614767 non-null  float64
 3   sale_amount     614767 non-null  float64
 4   sales_ratio     614767 non-null  float64
 5   property_type   614767 non-null  object 
dtypes: float64(3), int64(1), object(2)
memory usage: 32.8+ MB

# let's count the number of unique categories for town and property type
sales_df["property_type"].value_counts()

property_type
Single Family     401612
Condo             105420
Residential        60728
Two Family         26408
Three Family       12586
Vacant Land         3163
Four Family         2150
Commercial          1981
Apartments           486
Industrial           228
Public Utility         5
Name: count, dtype: int64

# town
sales_df["town"].value_counts()

town
Bridgeport       18264
Waterbury        17938
Stamford         17374
Norwalk          14637
Greenwich        12329
                 ...  
Eastford           266
Scotland           258
Canaan             250
Union              121
***Unknown***        1
Name: count, Length: 170, dtype: int64

One-Hot Encoding

One-hot encoding is a technique used in machine learning and data preprocessing to convert categorical data into a binary format that can be fed into machine learning algorithms. Each category or label in a categorical variable is represented as a binary vector, where only one element in the vector is "hot" (set to 1) to indicate the category, while all others are "cold" (set to 0). This allows machine learning algorithms to work with categorical data, which typically requires numerical inputs. One-hot encoding is essential when working with categorical variables in tasks like classification and regression, as it ensures that the model can interpret and make predictions based on these variables.

# let's replace the town and property columns using pandas' get_dummies()
sales_df_encoded = pd.get_dummies(data=sales_df, columns=['property_type', 'town'])
sales_df_clean_encoded = pd.get_dummies(data=sales_df_clean, columns=['property_type', 'town'])

	list_year	assessed_value	sale_amount	sales_ratio	property_type_Apartments	property_type_Commercial	property_type_Condo	property_type_Four Family	property_type_Industrial	property_type_Public Utility	...	town_Willington	town_Wilton	town_Winchester	town_Windham	town_Windsor	town_Windsor Locks	town_Wolcott	town_Woodbridge	town_Woodbury	town_Woodstock
0	2020	150500.0	325000.0	0.4630	False	True	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	2020	253000.0	430000.0	0.5883	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	2020	130400.0	179900.0	0.7248	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	2020	619290.0	890000.0	0.6958	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	2020	862330.0	1447500.0	0.5957	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

5 rows × 185 columns

Train the models

import sklearn
from sklearn.model_selection import train_test_split

# Split target variable and feature variables
y = sales_df_encoded['sale_amount']
X = sales_df_encoded.drop(columns=['sale_amount'])
y_cleaned = sales_df_clean_encoded['sale_amount']
X_cleaned = sales_df_clean_encoded.drop(columns=['sale_amount'])

Split training & test data

We need to split our data into training data and test data in an 80:20 or 70:30 ratio depending on the use case.

The reasoning is that we'll use the training data to train the model and the test data to validate it. The model needs to perform well with test data that it has never seen

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, shuffle=True, test_size=0.2)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_cleaned, y_cleaned, random_state=42, shuffle=True, test_size=0.2)

# print the shapes of the new X objects
print(X_train.shape)
print(X_train_c.shape)

(491813, 184)
(491595, 184)

RandomForestRegressor

In scikit-learn, the RandomForestRegressor is a machine learning model that is used for regression tasks. It is an ensemble learning method that combines the predictions of multiple decision trees to make more accurate and robust predictions. When creating a RandomForestRegressor, you can specify several parameters to configure its behavior. Here are some of the common parameters passed to the RandomForestRegressor:

n_estimators: The number of decision trees in the forest. Increasing the number of trees can improve the model's performance, but it also increases computation time.
max_depth: The maximum depth of each decision tree in the forest. It limits the tree's growth and helps prevent overfitting.
random_state: A random seed for reproducibility. If you set this value, you'll get the same results each time you run the model with the same data.

These are some of the commonly used parameters, but the RandomForestRegressor has many more options for fine-tuning your model. You can choose the appropriate values for these parameters based on your specific regression task and dataset.

Random Forest is a non-parametric model, meaning it does not make strong assumptions about the underlying data distribution. It is highly flexible and can capture complex relationships in the data.

Overfitting

This is the tendency of a model to be very sensitive to outliers. Such a model performs very well on training data but poorly on unseen data

Underfitting

Underfitting occurs when a model is too simple or lacks the capacity to capture the underlying patterns in the data. Such a model generalizes too much and results in inaccurate results.

from sklearn.ensemble import RandomForestRegressor

# Create a  regressor using all the feature variables
# we will tweak the parameters later to improve the performance of the model
rf_model = RandomForestRegressor(n_estimators=10,random_state=10)

rf_model_clean = RandomForestRegressor(n_estimators=10,random_state=10)
# Train the models using the training sets
rf_model.fit(X_train, y_train)
rf_model_clean.fit(X_train_c, y_train_c)

Run the predictions

After training the model, we use it to make predictions based on the unseen test data

#run the predictions on the training and testing data
y_rf_pred_test = rf_model.predict(X_test)
y_rf_clean_pred_test = rf_model_clean.predict(X_test_c)

#compare the actual values (ie, target) with the values predicted by the model
rf_pred_test_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_rf_pred_test})

rf_pred_test_df

	Actual	Predicted
769037	47000.0	46960.0
615577	463040.0	462650.0
635236	137000.0	137246.0
901781	235000.0	235000.0
630118	305000.0	305250.0
...	...	...
679168	378000.0	377490.0
965471	179900.0	179780.0
459234	295000.0	295100.0
876950	13000.0	13117.6
652908	295595.0	295595.0

122954 rows × 2 columns

#compare the actual values (ie, target) with the values predicted by the model
rf_pred_clean_test_df = pd.DataFrame({'Actual': y_test_c, 'Predicted': y_rf_clean_pred_test})

rf_pred_clean_test_df

	Actual	Predicted
915231	195000.0	194800.0
51414	198000.0	198540.0
520001	410000.0	409900.0
18080	230000.0	230500.0
456006	260000.0	260100.0
...	...	...
673908	160000.0	160000.0
767435	379900.0	380383.2
2850	2450000.0	2463313.0
954634	250000.0	250000.0
849689	140500.0	140394.4

122899 rows × 2 columns

Evaluate the models

Now let's compare what the model produced and the actual results
We'll use R^2 method to evaluate the model's performance. This method is primarily used to evaluate the performance of regressor models.

# Determine accuracy uisng 𝑅^2
from sklearn.metrics import r2_score

score = r2_score(y_test, y_rf_pred_test)
score_clean = r2_score(y_test_c, y_rf_clean_pred_test)

print("R^2 test Raw - {}%".format(round(score, 2) *100))
print("R^2 test Clean - {}%".format(round(score_clean, 2) *100))

R^2 test Raw - 78.0%
R^2 test Clean - 95.0%

From this we can see a clear difference in performance between the cleaned data and the raw data that includes the outliers. This is a basic way to solve the problem and in another tutorial we'll compare different methods of eliminating outliers to reduce overfitting.

Feature Importance

Its important to findout which of the feature variables we passed was the most important in determining the target variable which is the sale amount in our case. Random Forest Regressor provides a method to determine this. Lets plot the 5 most important features


plt.figure(figsize=(10,6))
feat_importances = pd.Series(rf_model.feature_importances_, index = X_train.columns)
feat_importances.nlargest(5).plot(kind='barh', title='Feature Importance in Raw Data')


plt.figure(figsize=(10,6))
feat_importances = pd.Series(rf_model_clean.feature_importances_, index = X_train_c.columns)
feat_importances.nlargest(5).plot(kind='barh', title='Feature Importance in Clean Data')

Interestingly, according to the raw data, a property being in Willington town is a higher predictor of the score than the assessed value. This indicates that the data is being affected by outliers.

In the cleaned data, we can clearly see that the assessed value is the most important feature, which is what we would expect from the data.

Cross validation

Hyper parameters can be used to tweak the performance of the model. Cross validation can be used to evaluate the performance of the model with different hyperparameters.

Consider the n_estimators parameter which represents the number of decision tress in the forest. Lets see whether increasing it can increase model performance

We'll use Kfold cross validation to see what effect changing the n_estimators has on the MSE (Mean Square Error) and R^2 score of the model. This should help us determine the optimal value

Here's how KFold cross-validation works:

Splitting the Data: The dataset is divided into k roughly equal-sized folds, where "k" is the number of desired folds specified as an argument.
Training and Validation: The model is trained and validated k times, with each of the k folds serving as the validation set exactly once. The remaining k-1 folds collectively form the training set.
Metrics: For each iteration (or "fold"), the model's performance is assessed using a chosen evaluation metric (e.g., mean squared error, accuracy, etc.).
Average and Variance: After completing k iterations, the evaluation metric scores are typically averaged to obtain a single performance estimate for the model. This average score provides an estimate of the model's performance and generalization ability.

Key benefits of KFold cross-validation include:

It provides a more reliable estimate of a model's performance by using multiple validation sets.
It helps detect overfitting because the model is evaluated on different data subsets.
It allows for a more thorough assessment of model performance, especially with limited data.

from sklearn.model_selection import KFold, cross_val_score
# Define a range of n_estimators values to test
n_estimators_values = [10, 20]

# Set up cross-validation (e.g., 5-fold cross-validation)
kf = KFold(n_splits=3, shuffle=True, random_state=42)

results = {'n_estimators': [], 'MSE': [], 'R-squared': []}

for n_estimators in n_estimators_values:
    model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)

    # Perform cross-validation and calculate MSE and R-squared
    mse_scores = -cross_val_score(model, X_cleaned, y_cleaned, cv=kf, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(model, X_cleaned, y_cleaned, cv=kf, scoring='r2')

    results['n_estimators'].append(n_estimators)
    results['MSE'].append(mse_scores.mean())
    results['R-squared'].append(r2_scores.mean())

results_df = pd.DataFrame(results)
print(results_df)

   n_estimators           MSE  R-squared
0            10  3.120811e+10   0.959579
1            20  2.970360e+10   0.961509

As you can see, there isn't much to gain by increasing the n_estimators but the runtime of the model increases significantly.

So what if we use a different regression model other than Random Forest? say the extreme Gradient Boosting regressor. Is there more accuracy to gain? Stay tuned.

Training a simple neural network

Ricky — Wed, 18 Oct 2023 19:26:23 +0000

A simple problem for training and testing a neural network is classifying the famous Iris dataset. The Iris dataset consists of 150 samples from three species of Iris flowers (Iris setosa, Iris virginica, and Iris versicolor). Each sample has four features: sepal length, sepal width, petal length, and petal width. The goal is to train a neural network to classify the Iris species based on these features.

Here's a step-by-step example using Python and popular libraries like NumPy and scikit-learn:

a. Load the Iris dataset:

You can load the Iris dataset from scikit-learn, a popular machine learning library in Python.

   from sklearn import datasets

   iris = datasets.load_iris()
   X = iris.data  # Features
   y = iris.target  # Target labels

b. Data Preprocessing:

You may need to preprocess the data, such as scaling features, splitting the dataset into training and testing sets, and one-hot encoding the target labels (if using neural networks).

c. Build and Train the Neural Network:

You can use deep learning libraries like TensorFlow or PyTorch to build and train your neural network. Here's an example using TensorFlow and Keras:

   import tensorflow as tf
   from tensorflow import keras

   model = keras.Sequential([
       keras.layers.Dense(8, activation='relu', input_shape=(4,)),
       keras.layers.Dense(3, activation='softmax')
   ])

   model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
   model.fit(X_train, y_train, epochs=100)

d. Evaluate the Model:

After training the model, you can evaluate its performance on the test set:

   test_loss, test_accuracy = model.evaluate(X_test, y_test)
   print(f'Test accuracy: {test_accuracy}')

This example demonstrates a simple neural network for classifying the Iris dataset, which is a classic introductory problem for machine learning. You can modify the neural network architecture, hyperparameters, or try different datasets as you become more comfortable with neural networks.

Webscraping using Playwright

Ricky — Tue, 10 Oct 2023 08:55:56 +0000

According to their twitter handle:

Playwright is an automation library for cross-browser end-to-end testing by Microsoft.

Think Selenium, Cypress, Puppeteer.

Primarily, Playwright is used for automating the testing of web applications but its uses extend beyond this. These are among the top features that playwright offers:

Cross-browser. Playwright supports all modern rendering engines including Chromium (Chrome, Edge, Opera), WebKit (Safari), and Gecko (Firefox).

Cross-platform. Test on Windows, Linux, and macOS, locally or on CI, headless or headed.

Cross-language. Use the Playwright API in TypeScript, JavaScript, Python, .NET, Java.

Test Mobile Web. Native mobile emulation of Google Chrome for Android and Mobile Safari. The same rendering engine works on your Desktop and in the Cloud.

In this article, we're going to explore these features by using Playwright as a web scraper to get some data from GitHub. Web-scraping is a method to obtain data from websites using programmable tools. There are many use-cases for web scraping, including Data Mining, Weather data monitoring etc. Please note, however, that many websites do frown upon it.

What we'll build
We're going to open a user's profile on Github and scrape it to get the repository names and languages of the pinned repositories.

Find the code for this tutorial here:
https://github.com/rickynyairo/playwright-webscraping

Prerequisites
Playwright supports a variety of programming languages, but we'll use Typescript for this tutorial. Among the benefits here is that, it comes with Typescript support out of the box. No setup and config steps required.
We'll use NodeJS LTS (v18.18.0)
To install NodeJS, depending on your platform, use Node Version Manager.



$ nvm install --lts

Proceed as follows once NodeJS is installed:
Install playwright using npm or yarn depending on your preferred package manager. Navigate to the folder in which you want to create the tests and run:



$ npm init playwright@latest

or using yarn:



$ yarn create playwright

For this article, we'll proceed assuming you're using npm.
Run the install command and select the following to get started:

Choose between TypeScript or JavaScript (default is TypeScript)
Name of your Tests folder (default is tests or e2e if you already have a tests folder in your project)
Add a GitHub Actions workflow to easily run tests on CI
Install Playwright browsers (default is true)

You should see this after a successful run:

Running this command will download the browsers and create a few files in your folder



playwright.config.ts
package.json
package-lock.json
tests/
  example.spec.ts
tests-examples/
  demo-todo-app.spec.ts

The playwright.config file defines the configuration for running playwright such as which browsers to use. We only need to run this on one browser. Edit projects property on the config object in this file and remove the other browsers to be left with:

playwright.config.ts



 /* Configure projects for major browsers */
  projects: [
    {
      name: 'chromium',
      use: { ...devices['Desktop Chrome'] },
    }
  ]

We're only interested in the tests/ folder which contains an example test. You can have a look at the test-examples/ folder for more detailed test examples.

Run this to verify correct initialisation:



$ npx playwright test

The tests should run and pass.

Next we'll navigate to a GitHub profile, find the pinned repositories and their languages. To do this we'll use a page and multiple locators.

Locator: A locator represents a way to find elements on the page at any moment.
Page object model: a page object represents a part of your web application. In this case, we'll have a profile page with locators for the profile avatar and pinned repositories.
Page Objects are a great way to create reusable parts of your web app that can be imported and reused for different tests without repeating all the logic for locators.

Let's create a Page Object for the profile picture that receives a username and opens the profile page.
Create a new file and call it ProfilePage.ts adding the following code there:

ProfilePage.ts



import { type Locator, type Page } from "@playwright/test";

export class GithubProfilePage {
  readonly page: Page;
  readonly pinnedRepositories: Locator;
  readonly username: string;

  constructor(page: Page, username: string) {
    this.username = username;
    this.page = page;
    this.pinnedRepositories = page.locator(".pinned-item-list-item-content");
  }

  async goto() {
    await this.page.goto(`https://github.com/${this.username}`);
  }

  async getPinnedRepositories() {
    return await this.pinnedRepositories.all();
  }

  async getRepoName(repo: Locator) {
    return await repo.locator(".text-bold").first().innerText();
  }

  async getRepoDescription(repo: Locator) {
    return await repo.locator(".pinned-item-desc").first().innerText();
  }

  async getRepoLanguage(repo: Locator) {
    return await repo
      .locator("[itemprop='programmingLanguage']")
      .first()
      .innerText();
  }
}

Here, we're initialising a profile page object using a username and defining a few functions to find what we want in the page including the repo names and language in each repo.
This page object is fully reusable across tests.

Rename the example.spec.ts file to github.spec.ts and replace the contents with the following:

github.spec.ts



import { test, expect } from '@playwright/test';
import { GithubProfilePage } from './ProfilePage';

test('can get pinned repositories and their languages', async ({ page }) => {
  const profilePage = new GithubProfilePage(page, 'rickynyairo');
  await profilePage.goto();
  const pinnedRepositories = await profilePage.getPinnedRepositories();

  // get repo names and languages
  const repoAndLanguage = await Promise.all(pinnedRepositories.map(async (repo) => {
    const title = await profilePage.getRepoName(repo);
    const description = await profilePage.getRepoDescription(repo);
    const language = await profilePage.getRepoLanguage(repo);
    return { title, description, language };
  }));

  // log the language and title as a table
  console.table(repoAndLanguage, ['title', 'description', 'language']);

  expect(pinnedRepositories.length).toBeGreaterThan(0);
});

We are then using the profile page to initialise opening the browser and navigating to the profile page. From here we can find the pinned repositories and their languages, and log them on the console.

Once done, run the following command:


 playwright test

This should result in the following output:

And there it is! you now have a simple web scraper. Explore using Playwright to do more automation on your own. You can, for example, use it to create an issue in a repository. To do this, you will have to go through a login process. This should make for a fun learning project.

Thanks for making it this far, please add a comment for any corrections you think I should make or just say hi :)

Special credit to:
https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72