DEV Community: Tamal Barman

Navigating the Housing Market Storm: A Data-Driven Approach

Tamal Barman — Tue, 05 Mar 2024 10:16:08 +0000

Introduction

In the vast landscape of the real estate market, understanding the dynamics that influence housing prices is crucial. In this blog post, we embark on a data-driven journey to explore the intricacies of housing price prediction, using advanced regression techniques and ensemble learning. The dataset under scrutiny is the well-known "House Prices: Advanced Regression Techniques" dataset from Kaggle.

Explanation of Random Forest

Random forests are powerful predictive models that allow for data-driven exploration of many explanatory variables in predicting a response or target variable. They provide importance scores for each explanatory variable and enable the evaluation of correct classification with varying numbers of trees.

Data Preprocessing

Before diving into the Random Forest analysis, it's essential to preprocess the data. This includes handling missing values, encoding categorical variables, and scaling features to ensure the model performs optimally.

Importing Modules and Data

We kick off by importing essential libraries such as NumPy, Pandas, and Scikit-Learn. The dataset, split into training and testing sets, is loaded into our analysis environment.

# Importing Modules
import sklearn
import scipy
# ... (other module imports)

# Importing training and testing data
train_data = pd.read_csv("/content/train.csv", index_col="Id")
test_data = pd.read_csv("/content/test.csv", index_col="Id")

Explanation of Response and Explanatory Variables

In our analysis, the response variable (dependent variable) is the sale price of the houses, while the explanatory variables (independent variables) include various features such as the size of the house, number of bedrooms, location, etc. These variables were chosen based on their relevance to predicting housing prices.

Data Visualization

Scatter Plot to Check Raw Outliers
Our exploration begins with a scatter plot visualizing the relationship between the ground living area and sale prices. This aids in identifying potential outliers, setting the stage for data cleansing.

fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.scatter(train_data["GrLivArea"], train_data["SalePrice"], c="#3f72af", zorder=3, alpha=0.9)
ax.axvline(4500, c="#112d4e", ls="--", zorder=2)
ax.set_xlabel("Ground living area (sq. ft)", labelpad=10)
ax.set_ylabel("Sale price ($)", labelpad=10)

Data Cleaning

Removing outliers is a pivotal step in refining the dataset. In this case, we exclude instances where the ground living area exceeds 4450 sq. ft.

train_data = train_data[train_data["GrLivArea"] < 4450]
data = pd.concat([train_data.drop("SalePrice", axis=1), test_data])

Bar Graph to Check Missing Values

Understanding the prevalence of missing values guides our imputation strategy. A bar graph illustrates the number of missing values for each feature.

nans = data.isna().sum().sort_values(ascending=False)
nans = nans[nans > 0]
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.bar(nans.index, nans.values, zorder=2, color="#3f72af")
ax.set_ylabel("No. of missing values", labelpad=10)
ax.set_xlim(-0.6, len(nans) - 0.4)
ax.xaxis.set_tick_params(rotation=90)
plt.show()

Exploring Numerical Variables

We delve into the analysis of numerical features, distinguishing between discrete and continuous variables.

Discrete Values

# Continuous Values
continuous_variables = []
for feature in numerical_features:
    if feature not in discrete_variables and feature not in ["YearBuilt", "YearRemodAdd", "GarageYrBlt", "YrSold"]:
        continuous_variables.append(feature)

print(continuous_variables)

for feature in continuous_variables:
    train_data[feature].hist(bins=30)
    plt.title(feature)
    plt.show()

Continuous Values

# ... (code for identifying and visualizing continuous variables)

Categorical Variables

# Categorical Values
categorical_features = []
for feature in train_data.columns:
    if train_data[feature].dtype == 'O' and feature != 'SalePrice':
        categorical_features.append(feature)
print(categorical_features)

for feature in categorical_features:
    train_data.groupby(feature)['SalePrice'].mean().plot.bar()
    plt.title(feature + ' vs Sale Price')
    plt.show()

Data Transformation & Feature Scaling

Data Transformation

# Data Transformation
data[["MSSubClass", "YrSold"]] = data[["MSSubClass", "YrSold"]].astype("category")  # converting into categorical value
data["MoSoldsin"] = np.sin(2 * np.pi * data["MoSold"] / 12)  # Sine Function
data["MoSoldcos"] = np.cos(2 * np.pi * data["MoSold"] / 12)  # Cosine Function
data = data.drop("MoSold", axis=1)

Feature Scaling

# Feature Scaling
cols = data.select_dtypes(np.number).columns
data[cols] = RobustScaler().fit_transform(data[cols])

Encoding

data = pd.get_dummies(data)

Feature Recovery & Removing Outliers

X_train = data.loc[train_data.index]
X_test = data.loc[test_data.index]

Optimization, Training, and Testing

Hyperparameter Optimization

# Hyper Parameter Optimization
kf = KFold(n_splits=5, random_state=0, shuffle=True)
rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))
scorer = make_scorer(rmse, greater_is_better=False)

# We use Randomized Search for Optimization, since it is more efficient.
# Define a function which takes in model and parameter grid as inputs, uses Radomized search, and returns fit of the model
def random_search(model, grid, n_iter=100):
    if model == xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4):
        searchxg = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return searchxg.fit(X_train, y, early_stopping_rounds=5, verbose=True)
    else:
        search = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return search.fit(X_train, y)

# Hyperparameter Grids
xgb_hpg = {'n_estimators': [100, 400, 800], 'max_depth': [3, 6, 9], 'learning_rate': [0.05, 0.1, 0.20], 'min_child_weight': [1, 10, 100]}  # XGBoost
ridge_hpg = {"alpha": np.logspace(-1, 2, 500)}  # Ridge Regressor
lasso_hpg = {"alpha": np.logspace(-5, -1, 500)}  # Lasso Regressor
svr_hpg = {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)}  # Support Vector Regressor
lgbm_hpg = {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # LGBM
gbm_hpg = {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # Gradient Boost
cat_hpg = {'depth': [2, 9], 'iterations': [10, 30], 'learning_rate': [0.001, 0.1]}

# Randomized Search for each model
xgb_search = random_search(xgb.XGBRegressor(n_estimators=1000, n_jobs=4), xgb_hpg)  # XGBoost
ridge_search = random_search(Ridge(), ridge_hpg)  # Ridge Regressor
lasso_search = random_search(Lasso(), lasso_hpg)  # Lasso Regressor
svr_search = random_search(SVR(), svr_hpg, n_iter=100)  # Support Vector Regressor
lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), lgbm_hpg, n_iter=100)  # LGBM
gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), gbm_hpg, n_iter=100)  # Gradient Boost

Random Forest Analysis

Now, let's run a Random Forest analysis to predict housing prices. We'll split the data into training and testing sets, train the model on the training data, and evaluate its performance on the testing data.

Ensemble Learning Model

# Ensemble Learning Model
models = [search.best_estimator_ for search in [xgb_search, ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]]  # list of best estimators from each model
ensemble_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20)  # Ensemble Stack
models.append(ensemble_search.best_estimator_)  # list of best estimators from each model including Stack

Predicting Values & Submission

# Predicting Values & Submission
prediction = [i.predict(X_test) for i in models]  # Np array of Predictions
predictions = np.average(prediction, axis=0)  # average of all the values

# Convert the predictions into the given format, and finally convert them back to normal using the exponential function
my_prediction = pd.DataFrame({"Id": test_data.index, "SalePrice": np.exp(predictions)})  # given format
my_prediction.to_csv("E:\Education\Kaggle Projects\House Price - Advanced Regression/my_prediction_ensemble.csv", index=False)  # Saving to CSV

Output Interpretation

The accuracy score obtained from the Random Forest model indicates how well the model predicts housing prices based on the given features. We can further analyze the importance scores assigned to each explanatory variable to understand their impact on the prediction.

Conclusion

Our data-driven exploration and modelling journey provides valuable insights into predicting housing prices. By leveraging advanced regression techniques and ensemble learning, we navigate through challenges, optimize models, and make predictions that contribute to the dynamic landscape of the housing market.

Click here to view and run the code in Google Colab.

Stay tuned for more data-driven adventures!

A Dive into Predictive Modeling for Internet Usage Rates

Tamal Barman — Tue, 05 Mar 2024 09:51:48 +0000

Introduction

In the era of data-driven insights, machine learning stands at the forefront, revolutionizing our approach to complex problem-solving. In this blog post, we embark on a journey through the development of a predictive model that focuses on internet usage rates, employing a variety of techniques and leveraging the prowess of Python's data science ecosystem.

Data Exploration

The digital landscape's evolution has reshaped how we perceive and interact with the world. Our exploration begins with the goal of predicting internet usage rates, a critical metric reflecting societal connectivity.

Data Loading and Preprocessing

# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Loading the dataset
data = pd.read_csv("internet_usage_data.csv")

# Displaying the dataset
print(data.head())

# Data cleaning and preprocessing
# (include code snippets for handling missing values, converting variables, etc.)

Building Predictive Models

The core of our journey lies in creating predictive models to forecast internet usage rates. We employ both Random Forest and Extra Trees classifiers to achieve this goal.

Random Forest Classifier

# Splitting the data into features and target variable
X = data.drop("internet_usage", axis=1)
y = data["internet_usage"]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Building the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Making predictions
rf_predictions = rf_classifier.predict(X_test)

# Evaluating the model
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_conf_matrix = confusion_matrix(y_test, rf_predictions)

print("Random Forest Classifier Results:")
print("Accuracy Score:", rf_accuracy)
print("Confusion Matrix:")
print(rf_conf_matrix)

Extra Trees Classifier


# Building the Extra Trees Classifier
et_classifier = ExtraTreesClassifier(n_estimators=100, random_state=42)
et_classifier.fit(X_train, y_train)

# Making predictions
et_predictions = et_classifier.predict(X_test)

# Evaluating the model
et_accuracy = accuracy_score(y_test, et_predictions)

print("Extra Trees Classifier Results:")
print("Accuracy Score:", et_accuracy)

Conclusion

Our journey concludes with the successful development and evaluation of predictive models for internet usage rates. Through the application of machine learning techniques and Python's data science ecosystem, we gain valuable insights into societal connectivity patterns.

Future Directions

As we look ahead, the potential applications of our models are vast. From informing policy decisions to guiding infrastructure development, the insights derived from internet usage predictions hold promise for driving positive societal change.

Acknowledgements

This project would not have been possible without the support and contributions of the open-source community, libraries like scikit-learn, and the wealth of knowledge shared by data science pioneers.

Explore the Code Yourself!

The beauty of open-source and collaborative learning is the ability to explore and experiment. If you're eager to dive into the code and run the models yourself, feel free to access the Google Colab file by following this link. The Colab file provides an interactive environment where you can tweak parameters, visualize results, and gain a hands-on understanding of the machine-learning process.

Getting Started

Click on the provided Colab link.
Once the Colab file opens, navigate through each code cell.
Experiment with different parameters and observe how the model responds.
Run the code to witness real-time results.

Share Your Insights

Did you discover something interesting or have questions? Join the discussion by leaving comments in the Colab file. Your insights and queries contribute to the collaborative nature of data science.

Week 3. Running a Lasso Regression Analysis

Tamal Barman — Fri, 10 Nov 2023 16:48:36 +0000

%matplotlib inline from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV

pd.set_option(’display.float_format’, lambda x:’%.3f’%x)
pd.set_option(’display.mpl_style’, ’default’) # --deprecated
plt.style.use(’ggplot’) # Make the graphs a bit prettier

plt.rcParams[’figure.figsize’] = (15, 5)

Load the dataset

loans = pd.read_csv("./LendingClub.csv", low_memory = False)

LendingClub.csv is a dataset taken from The LendingClub (https://www.lendingclub.com/)
which is a peer-to-peer leading company that directly connects borrowers and potential
lenders/investors

Exploring the target column

_The target column (label column) of the dataset that we are interested in is called
‘bad_loans‘. In this column **1 means a risky (bad) loan 0 means a safe loan.
In order to make this more intuitive, we reassign the target to be:* ** 1 ** as a safe loan,

** 0 ** as a risky (bad) loan. We put this in a new column called ‘safe_loans‘**_

loans[’safe_loans’] = loans[’bad_loans’].apply(lambda x : 1 if x == 0 else 0) loans.drop(’bad_loans’, axis = 1, inplace = True)

Select features to handle

predictors = [’grade’, # grade of the loan ’sub_grade’, # sub-grade of the loan ’short_emp’, # one year or less of employment ’emp_length_num’, # number of years of employment ’home_ownership’, # home_ownership status: own, mortgage or rent ’dti’, # debt to income ratio ’purpose’, # the purpose of the loan ’term’, # the term of the loan ’last_delinq_none’, # has borrower had a delinquincy ’last_major_derog_none’, # has borrower had 90 day or worse rating ’revol_util’, # percent of available credit being used ’total_rec_late_fee’, # total late fees received to day ]
target = ’safe_loans’ # prediction target (y) (+1 means safe, 0 is risky)

Extract the predictors and target columns

loans = loans[predictors + [target]]

Delete rows where any or all of the data are missing

data_clean = loans.dropna()

Convert categorical variables into binary variables

data_clean = pd.get_dummies(data_clean, prefix_sep = ’=’)

(data_clean.describe()).T

            count mean std min 25% 50% n
short emp 122607.0 0.123672 0.329208 0.0 0.00 0.00
emp length num 122607.0 6.370256 3.736014 0.0 3.00 6.00
dti 122607.0 15.496888 7.497442 0.0 9.88 15.26
last delinq none 122607.0 0.588115 0.492177 0.0 0.00 1.00
last major derog none 122607.0 0.873906 0.331957 0.0 1.00 1.00
revol util 122607.0 53.716307 25.723881 0.0 34.80 55.70
total rec late fee 122607.0 0.742344 5.363268 0.0 0.00 0.00
safe loans 122607.0 0.811185 0.391363 0.0 1.00 1.00
grade=A 122607.0 0.181996 0.385843 0.0 0.00 0.00
grade=B 122607.0 0.303180 0.459634 0.0 0.00 0.00
grade=C 122607.0 0.244276 0.429659 0.0 0.00 0.00
grade=D 122607.0 0.156394 0.363230 0.0 0.00 0.00
grade=E 122607.0 0.073324 0.260668 0.0 0.00 0.00
grade=F 122607.0 0.032070 0.176187 0.0 0.00 0.00
grade=G 122607.0 0.008760 0.093183 0.0 0.00 0.00
sub grade=A1 122607.0 0.024362 0.154172 0.0 0.00 0.00
sub grade=A2 122607.0 0.027339 0.163071 0.0 0.00 0.00
sub grade=A3 122607.0 0.032258 0.176684 0.0 0.00 0.00
sub grade=A4 122607.0 0.048880 0.215617 0.0 0.00 0.00
sub grade=A5 122607.0 0.049157 0.216197 0.0 0.00 0.00
sub grade=B1 122607.0 0.047607 0.212935 0.0 0.00 0.00
sub grade=B2 122607.0 0.057876 0.233510 0.0 0.00 0.00
sub grade=B3 122607.0 0.073699 0.261281 0.0 0.00 0.00
sub grade=B4 122607.0 0.067525 0.250930 0.0 0.00 0.00
sub grade=B5 122607.0 0.056473 0.230834 0.0 0.00 0.00
sub grade=C1 122607.0 0.057648 0.233077 0.0 0.00 0.00
sub grade=C2 122607.0 0.054858 0.227704 0.0 0.00 0.00
sub grade=C3 122607.0 0.046408 0.210369 0.0 0.00 0.00
sub grade=C4 122607.0 0.044059 0.205228 0.0 0.00 0.00
sub grade=C5 122607.0 0.041303 0.198990 0.0 0.00 0.00
... ... ... ... ... ... ...
sub grade=E4 122607.0 0.012895 0.112821 0.0 0.00 0.00
sub grade=E5 122607.0 0.011092 0.104735 0.0 0.00 0.00
sub grade=F1 122607.0 0.009013 0.094506 0.0 0.00 0.00
sub grade=F2 122607.0 0.007585 0.086763 0.0 0.00 0.00
sub grade=F3 122607.0 0.006280 0.078999 0.0 0.00 0.00
sub grade=F4 122607.0 0.005130 0.071442 0.0 0.00 0.00
sub grade=F5 122607.0 0.004062 0.063603 0.0 0.00 0.00
sub grade=G1 122607.0 0.003018 0.054852 0.0 0.00 0.00
sub grade=G2 122607.0 0.001966 0.044292 0.0 0.00 0.00
sub grade=G3 122607.0 0.001362 0.036881 0.0 0.00 0.00
sub grade=G4 122607.0 0.001240 0.035188 0.0 0.00 0.00
sub grade=G5 122607.0 0.001174 0.034251 0.0 0.00 0.00
home ownership=MORTGAGE 122607.0 0.483170 0.499719 0.0 0.00 0.00
home ownership=OTHER 122607.0 0.001460 0.038182 0.0 0.00 0.00
home ownership=OWN 122607.0 0.081097 0.272984 0.0 0.00 0.00
home ownership=RENT 122607.0 0.434274 0.495663 0.0 0.00 0.00
purpose=car 122607.0 0.019371 0.137825 0.0 0.00 0.00
purpose=credit card 122607.0 0.179843 0.384058 0.0 0.00 0.00
purpose=debt consolidation 122607.0 0.556518 0.496797 0.0 0.00 1.00
purpose=home improvement 122607.0 0.061522 0.240286 0.0 0.00 0.00
purpose=house 122607.0 0.008197 0.090165 0.0 0.00 0.00
purpose=major purchase 122607.0 0.031621 0.174991 0.0 0.00 0.00
purpose=medical 122607.0 0.013107 0.113733 0.0 0.00 0.00
purpose=moving 122607.0 0.009624 0.097630 0.0 0.00 0.00
purpose=other 122607.0 0.074115 0.261959 0.0 0.00 0.00
purpose=small business 122607.0 0.026622 0.160976 0.0 0.00 0.00
purpose=vacation 122607.0 0.007014 0.083457 0.0 0.00 0.00
purpose=wedding 122607.0 0.012446 0.110867 0.0 0.00 0.00
term= 36 months 122607.0 0.797679 0.401732 0.0 1.00 1.00
term= 60 months 122607.0 0.202321 0.401732 0.0 0.00 0.00
75% max
short emp 0.00 1.00
emp length num 11.00 11.00
dti 20.85 39.88
last delinq none 1.00 1.00
last major derog none 1.00 1.00
revol util 74.30 150.70
total rec late fee 0.00 208.82
safe loans 1.00 1.00
grade=A 0.00 1.00
grade=B 1.00 1.00
grade=C 0.00 1.00
grade=D 0.00 1.00
grade=E 0.00 1.00
grade=F 0.00 1.00
grade=G 0.00 1.00
sub grade=A1 0.00 1.00
sub grade=A2 0.00 1.00
sub grade=A3 0.00 1.00
sub grade=A4 0.00 1.00
sub grade=A5 0.00 1.00
sub grade=B1 0.00 1.00
sub grade=B2 0.00 1.00
sub grade=B3 0.00 1.00
sub grade=B4 0.00 1.00
sub grade=B5 0.00 1.00
sub grade=C1 0.00 1.00
sub grade=C2 0.00 1.00
sub grade=C3 0.00 1.00
sub grade=C4 0.00 1.00
sub grade=C5 0.00 1.00
... ... ...
sub grade=E4 0.00 1.00
sub grade=E5 0.00 1.00
sub grade=F1 0.00 1.00
sub grade=F2 0.00 1.00
sub grade=F3 0.00 1.00
sub grade=F4 0.00 1.00
sub grade=F5 0.00 1.00
sub grade=G1 0.00 1.00
sub grade=G2 0.00 1.00
sub grade=G3 0.00 1.00
sub grade=G4 0.00 1.00
sub grade=G5 0.00 1.00
home ownership=MORTGAGE 1.00 1.00
home ownership=OTHER 0.00 1.00
home ownership=OWN 0.00 1.00
home ownership=RENT 1.00 1.00
purpose=car 0.00 1.00
purpose=credit card 0.00 1.00
purpose=debt consolidation 1.00 1.00
purpose=home improvement 0.00 1.00
purpose=house 0.00 1.00
purpose=major purchase 0.00 1.00
purpose=medical 0.00 1.00
purpose=moving 0.00 1.00
purpose=other 0.00 1.00
purpose=small business 0.00 1.00
purpose=vacation 0.00 1.00
purpose=wedding 0.00 1.00
term= 36 months 1.00 1.00
term= 60 months 0.00 1.00
[68 rows x 8 columns]

Extract new features names

features = data_clean.columns.values features = features[features != target]

Modeling and Prediction

predvar = data_clean[features] predictors = predvar.copy() target = data_clean.safe_loans

Standardize predictors to have mean=0 and sd=1

from sklearn import preprocessing for attr in predictors.columns.values: predictors[attr] = preprocessing.scale(predictors[attr].astype(’float64’))

Split into training and testing sets

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size = .4, random_state = 123) print(’pred_train.shape’, pred_train.shape) print(’pred_test.shape’, pred_test.shape) print(’tar_train.shape’, tar_train.shape) print(’tar_test.shape’, tar_test.shape) pred train.shape (73564, 67) pred test.shape (49043, 67) tar train.shape (73564,) tar test.shape (49043,)

Specify the lasso regression model

model = LassoLarsCV(cv = 10, precompute = False).fit(pred_train, tar_train)

Print variable names and regression coefficients

pd.DataFrame([dict(zip(predictors.columns, model.coef_))], index=[’coef’]).T

     coef
dti -0.031080
emp length num 0.000000
grade=A 0.036706
grade=B 0.015090
grade=C 0.000000
grade=D -0.014125
grade=E -0.017393
grade=F -0.015787
grade=G -0.009659
home ownership=MORTGAGE 0.010588
home ownership=OTHER -0.001163
home ownership=OWN 0.000000
home ownership=RENT -0.007503
last delinq none -0.008092
last major derog none -0.002272
purpose=car 0.000000
purpose=credit card 0.006216
purpose=debt consolidation 0.000000
purpose=home improvement -0.001184
purpose=house 0.000000
purpose=major purchase 0.000000
purpose=medical -0.003115
purpose=moving -0.000223
purpose=other -0.005091
purpose=small business -0.019715
purpose=vacation -0.000638
purpose=wedding 0.002051
revol util -0.012068
short emp -0.009104
sub grade=A1 0.002509
... ...
sub grade=B4 0.000000
sub grade=B5 -0.001996
sub grade=C1 0.001869
sub grade=C2 0.001481
sub grade=C3 0.000000
sub grade=C4 -0.001447
sub grade=C5 0.000000
sub grade=D1 0.000000
sub grade=D2 0.000000
sub grade=D3 -0.000327
sub grade=D4 -0.003696
sub grade=D5 -0.001762
sub grade=E1 0.004096
sub grade=E2 0.000000
sub grade=E3 -0.001249
sub grade=E4 -0.003977
sub grade=E5 -0.001715
sub grade=F1 0.000000
sub grade=F2 0.000000
sub grade=F3 -0.003955
sub grade=F4 -0.004832
sub grade=F5 -0.005503
sub grade=G1 -0.000474
sub grade=G2 -0.001004
sub grade=G3 0.000000
sub grade=G4 0.004561
sub grade=G5 0.000000
term= 36 months 0.026618
term= 60 months -0.005477
total rec late fee -0.044206
[67 rows x 1 columns]

Plot coefficient progression

m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle = ’--’, color = ’k’, label = ’alpha CV’) plt.ylabel(’Regression Coefficients’) plt.xlabel(’-log(alpha)’) plt.title(’Regression Coefficients Progression for Lasso Paths’)

<matplotlib.text.Text at 0x17aaab94390>

Plot mean square error for each fold

m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ’:’) plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis = -1), ’k’, label = ’Average across the folds’, linewidth = 2) plt.axvline(-np.log10(model.alpha_), linestyle = ’--’, color = ’k’, label = ’alpha CV’) plt.legend() plt.xlabel(’-log(alpha)’) plt.ylabel(’Mean squared error’) plt.title(’Mean squared error on each fold’)

<matplotlib.text.Text at 0x17aac05cef0>

MSE from training and test data

from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print (’training data MSE’) print(train_error) print (’test data MSE’) print(test_error)

training data MSE
0.141354906717
test data MSE
0.140656085708

R-square from training and test data

rsquared_train = model.score(pred_train, tar_train) rsquared_test = model.score(pred_test, tar_test) print (’training data R-square’) print(rsquared_train) print (’test data R-square’) print(rsquared_test)

training data R-square
0.0799940399148
test data R-square
0.0772929635462

Week 2 Running a Random Forest

Tamal Barman — Sat, 04 Nov 2023 13:02:51 +0000

Input:
%matplotlib inline
from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics

Feature Importance

from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier

pd.set_option(’display.float_format’, lambda x:’%.3f’%x)

("C:/Users/Tamal/Documents/COURSES/Wesleyan University/Machine Learning for Data Analysis)

Machine Learning for Data Analysis

Load the dataset

loans = pd.read_csv("./LendingClub.csv", low_memory = False)
LendingClub.csv is a dataset taken from The LendingClub (https://www.lendingclub.com/)
which is a peer-to-peer leading company that directly connects borrowers and potential
lenders/investors

Exploring the target column

The target column (label column) of the dataset that we are interested in is called‘bad_loans‘. In this column, 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive, we reassign the target to be:* +1 as a safe loan, * -1 as a risky (bad) loan. We put this in a new column called ‘safe_loans‘.

loans[’safe_loans’] = loans[’bad_loans’].apply(lambda x : +1 if x==0 else -1) loans.drop(’bad_loans’, axis = 1, inplace = True)

Select features to handle

predictors = [’grade’, grade of the loan
’sub_grade’, sub-grade of the loan
’short_emp’, one year or less of employment
’emp_length_num’, number of years of employment
’home_ownership’, home_ownership status: own, mortgage or rent
’dti’, debt to income ratio
’purpose’, the purpose of the loan
’term’, the term of the loan
’last_delinq_none’, has borrower had a delinquincy
’last_major_derog_none’, has borrower had 90 day or worse rating
’revol_util’, percent of available credit being used
’total_rec_late_fee’, total late fees received to day
]
target = ’safe_loans’ prediction target (y) (+1 means safe, -1 is risky)

Extract the predictors and target columns
loans = loans[predictors + [target]]

Delete rows where any or all of the data are missing
data_clean = loans.dropna()

Convert categorical variables into binary variables
(Categorical features are not, yet, supported by sklearn DecisionTreeClassifier)
data_clean = pd.get_dummies(data_clean, prefix_sep = ’=’)

print(data_clean.dtypes)

(data_clean.describe()).T
short emp                                         int64
emp length num                                    int64
dti                                               float64
last delinq none                                  int64
last major derog none                             int64
revol util                                        float64
total rec late fee                                float64
safe loans                                        int64
grade=A                                           float64
grade=B                                           float64
grade=C                                           float64
grade=D                                           float64
grade=E                                           float64
grade=F                                           float64
grade=G                                           float64
sub grade=A1                                      float64
sub grade=A2                                      float64
sub grade=A3                                      float64
sub grade=A4                                      float64
sub grade=A5                                      float64
sub grade=B1                                      float64
sub grade=B2                                      float64
sub grade=B3                                      float64
sub grade=B4                                      float64
sub grade=B5                                      float64
sub grade=C1                                      float64
sub grade=C2                                      float64
sub grade=C3                                      float64
sub grade=C4                                      float64
sub grade=C5                                      float64
sub grade=E4                                      float64
sub grade=E5                                      float64
sub grade=F1                                      float64
sub grade=F2                                      float64
sub grade=F3                                      float64
sub grade=F4                                      float64
sub grade=F5                                      float64
sub grade=G1                                      float64
sub grade=G2                                      float64
sub grade=G3                                      float64
sub grade=G4                                      float64
sub grade=G5                                      float64
home ownership=MORTGAGE                           float64
home ownership=OTHER                              float64
home ownership=OWN                                float64
home ownership=RENT                               float64
purpose=car                                       float64
purpose=credit card                               float64
purpose=debt consolidation                        float64
purpose=home improvement                          float64
purpose=house                                     float64
purpose=major purchase                            float64
purpose=medical                                   float64
purpose=moving                                    float64
purpose=other                                     float64
purpose=small business                            float64
purpose=vacation                                  float64
purpose=wedding                                   float64
term= 36 months                                   float64
term= 60 months                                   float64
dtype: object

Out:

          count mean std min 25% 50% n
short emp 122607.0 0.123672 0.329208 0.0 0.00 0.00
3emp length num 122607.0 6.370256 3.736014 0.0 3.00 6.00
dti 122607.0 15.496888 7.497442 0.0 9.88 15.26
last delinq none 122607.0 0.588115 0.492177 0.0 0.00 1.00
last major derog none 122607.0 0.873906 0.331957 0.0 1.00 1.00
revol util 122607.0 53.716307 25.723881 0.0 34.80 55.70
total rec late fee 122607.0 0.742344 5.363268 0.0 0.00 0.00
safe loans 122607.0 0.622371 0.782726 -1.0 1.00 1.00
grade=A 122607.0 0.181996 0.385843 0.0 0.00 0.00
grade=B 122607.0 0.303180 0.459634 0.0 0.00 0.00
grade=C 122607.0 0.244276 0.429659 0.0 0.00 0.00
grade=D 122607.0 0.156394 0.363230 0.0 0.00 0.00
grade=E 122607.0 0.073324 0.260668 0.0 0.00 0.00
grade=F 122607.0 0.032070 0.176187 0.0 0.00 0.00
grade=G 122607.0 0.008760 0.093183 0.0 0.00 0.00
sub grade=A1 122607.0 0.024362 0.154172 0.0 0.00 0.00
sub grade=A2 122607.0 0.027339 0.163071 0.0 0.00 0.00
sub grade=A3 122607.0 0.032258 0.176684 0.0 0.00 0.00
sub grade=A4 122607.0 0.048880 0.215617 0.0 0.00 0.00
sub grade=A5 122607.0 0.049157 0.216197 0.0 0.00 0.00
sub grade=B1 122607.0 0.047607 0.212935 0.0 0.00 0.00
sub grade=B2 122607.0 0.057876 0.233510 0.0 0.00 0.00
sub grade=B3 122607.0 0.073699 0.261281 0.0 0.00 0.00
sub grade=B4 122607.0 0.067525 0.250930 0.0 0.00 0.00
sub grade=B5 122607.0 0.056473 0.230834 0.0 0.00 0.00
sub grade=C1 122607.0 0.057648 0.233077 0.0 0.00 0.00
sub grade=C2 122607.0 0.054858 0.227704 0.0 0.00 0.00
sub grade=C3 122607.0 0.046408 0.210369 0.0 0.00 0.00
sub grade=C4 122607.0 0.044059 0.205228 0.0 0.00 0.00
sub grade=C5 122607.0 0.041303 0.198990 0.0 0.00 0.00
sub grade=E4 122607.0 0.012895 0.112821 0.0 0.00 0.00
sub grade=E5 122607.0 0.011092 0.104735 0.0 0.00 0.00
sub grade=F1 122607.0 0.009013 0.094506 0.0 0.00 0.00
sub grade=F2 122607.0 0.007585 0.086763 0.0 0.00 0.00
sub grade=F3 122607.0 0.006280 0.078999 0.0 0.00 0.00
sub grade=F4 122607.0 0.005130 0.071442 0.0 0.00 0.00
sub grade=F5 122607.0 0.004062 0.063603 0.0 0.00 0.00
sub grade=G1 122607.0 0.003018 0.054852 0.0 0.00 0.00
sub grade=G2 122607.0 0.001966 0.044292 0.0 0.00 0.00
sub grade=G3 122607.0 0.001362 0.036881 0.0 0.00 0.00
sub grade=G4 122607.0 0.001240 0.035188 0.0 0.00 0.00
sub grade=G5 122607.0 0.001174 0.034251 0.0 0.00 0.00
home ownership=MORTGAGE 122607.0 0.483170 0.499719 0.0 0.00 0.00
home ownership=OTHER 122607.0 0.001460 0.038182 0.0 0.00 0.00
home ownership=OWN 122607.0 0.081097 0.272984 0.0 0.00 0.00
home ownership=RENT 122607.0 0.434274 0.495663 0.0 0.00 0.00
purpose=car 122607.0 0.019371 0.137825 0.0 0.00 0.00
purpose=credit card 122607.0 0.179843 0.384058 0.0 0.00 0.00
purpose=debt consolidation 122607.0 0.556518 0.496797 0.0 0.00 1.00
purpose=home improvement 122607.0 0.061522 0.240286 0.0 0.00 0.00
purpose=house 122607.0 0.008197 0.090165 0.0 0.00 0.00
purpose=major purchase 122607.0 0.031621 0.174991 0.0 0.00 0.00
purpose=medical 122607.0 0.013107 0.113733 0.0 0.00 0.00
purpose=moving 122607.0 0.009624 0.097630 0.0 0.00 0.00
4purpose=other 122607.0 0.074115 0.261959 0.0 0.00 0.00
purpose=small business 122607.0 0.026622 0.160976 0.0 0.00 0.00
purpose=vacation 122607.0 0.007014 0.083457 0.0 0.00 0.00
purpose=wedding 122607.0 0.012446 0.110867 0.0 0.00 0.00
term= 36 months 122607.0 0.797679 0.401732 0.0 1.00 1.00
term= 60 months 122607.0 0.202321 0.401732 0.0 0.00 0.00
75% max
short emp 0.00 1.00
emp length num 11.00 11.00
dti 20.85 39.88
last delinq none 1.00 1.00
last major derog none 1.00 1.00
revol util 74.30 150.70
total rec late fee 0.00 208.82
safe loans 1.00 1.00
grade=A 0.00 1.00
grade=B 1.00 1.00
grade=C 0.00 1.00
grade=D 0.00 1.00
grade=E 0.00 1.00
grade=F 0.00 1.00
grade=G 0.00 1.00
sub grade=A1 0.00 1.00
sub grade=A2 0.00 1.00
sub grade=A3 0.00 1.00
sub grade=A4 0.00 1.00
sub grade=A5 0.00 1.00
sub grade=B1 0.00 1.00
sub grade=B2 0.00 1.00
sub grade=B3 0.00 1.00
sub grade=B4 0.00 1.00
sub grade=B5 0.00 1.00
sub grade=C1 0.00 1.00
sub grade=C2 0.00 1.00
sub grade=C3 0.00 1.00
sub grade=C4 0.00 1.00
sub grade=C5 0.00 1.00
sub grade=E4 0.00 1.00
sub grade=E5 0.00 1.00
sub grade=F1 0.00 1.00
sub grade=F2 0.00 1.00
sub grade=F3 0.00 1.00
sub grade=F4 0.00 1.00
sub grade=F5 0.00 1.00
sub grade=G1 0.00 1.00
sub grade=G2 0.00 1.00
sub grade=G3 0.00 1.00
sub grade=G4 0.00 1.00
sub grade=G5 0.00 1.00
home ownership=MORTGAGE 1.00 1.00
home ownership=OTHER 0.00 1.00
home ownership=OWN 0.00 1.00
5home ownership=RENT 1.00 1.00
purpose=car 0.00 1.00
purpose=credit card 0.00 1.00
purpose=debt consolidation 1.00 1.00
purpose=home improvement 0.00 1.00
purpose=house 0.00 1.00
purpose=major purchase 0.00 1.00
purpose=medical 0.00 1.00
purpose=moving 0.00 1.00
purpose=other 0.00 1.00
purpose=small business 0.00 1.00
purpose=vacation 0.00 1.00
purpose=wedding 0.00 1.00
term= 36 months 1.00 1.00
term= 60 months 0.00 1.00
[68 rows x 8 columns]

Extract new features names
features = data_clean.columns.values features = features[features != target]

Modeling and Prediction
Split into training and testing sets

predictors = data_clean[features] targets = data_clean.safe_loans pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size = .4) print(’pred_train.shape’, pred_train.shape) print(’pred_test.shape’, pred_test.shape) print(’tar_train.shape’, tar_train.shape) print(’tar_test.shape’, tar_test.shape) pred train.shape (73564, 67) pred test.shape (49043, 67) tar train.shape (73564,) tar test.shape (49043,)

Build model on training data
from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 25) classifier = classifier.fit(pred_train, tar_train) predictions = classifier.predict(pred_test) conf_matrix = sklearn.metrics.confusion_matrix(tar_test, predictions) print(conf_matrix) 6[[ 1200 7957] [ 2068 37818]]

sklearn.metrics.accuracy_score(tar_test, predictions)
Out:
0.79558754562322864
fit an Extra Trees model to the data
model = ExtraTreesClassifier() model.fit(pred_train, tar_train)

display the relative importance of each attribute
print(model.feature_importances_) [ 0.0116846 0.13741726 0.29752994 0.02180671 0.01403449 0.28940768 0.03293531 0.00951938 0.0037876 0.00329254 0.00387515 0.00546364 0.00417739 0.00154924 0.00084011 0.00069246 0.00081152 0.00096394 0.00116574 0.00167741 0.00205909 0.00237266 0.00246947 0.00251691 0.00305698 0.00325398 0.00302473 0.00332757 0.00274992 0.00220827 0.00224134 0.00220883 0.00217347 0.00213623 0.00171983 0.00202893 0.00204825 0.0020539 0.00177719 0.00121543 0.00119387 0.00116458 0.00113021 0.00095636 0.00060222 0.00054441 0.00046878 0.00038406 0.00040649 0.00745245 0.00060827 0.00606502 0.00744603 0.00355398 0.00929081 0.01264373 0.00652372 0.00243849 0.00420151 0.0037273 0.00321748 0.00838174 0.00601524 0.00236188 0.00309275 0.00558167 0.01127186]

Show more important features
more_important_features = list() predictors_list = list(predictors.columns.values) idx = 0 for imp in model.feature_importances_: if imp >= 0.1: more_important_features.append(predictors_list[idx]) idx += 1 print(’More important features:’, more_important_features) More important features: [’emp length num’, ’dti’, ’revol util’]

Running a different number of trees and see the effect of that on the accuracy of the prediction

trees = range(25) accuracy = np.zeros(25) for idx in range(len(trees)): classifier = RandomForestClassifier(n_estimators = idx + 1) classifier = classifier.fit(pred_train,tar_train) predictions = classifier.predict(pred_test) accuracy[idx] = sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla() # Clear axis plt.plot(trees, accuracy)

Out:

[<matplotlib.lines.Line2D at 0x18bccd1e3c8>]

Running a Random Forest Using Python

Tamal Barman — Tue, 18 Jul 2023 04:29:15 +0000

Introduction:

A Practical End-to-End Machine Learning Example

There has never been a better time to delve into machine learning. The abundance of learning resources available online, coupled with free open-source tools offering implementations of a wide range of algorithms, and the affordable availability of computing power through cloud services like AWS, has truly democratized the field of machine learning. Now, anyone with a laptop and a willingness to learn can experiment with state-of-the-art algorithms within minutes. With a little more time and effort, you can develop practical models to assist you in your daily life or work, and even transition into the machine learning field to reap its economic benefits. In this post, I will guide you through an end-to-end implementation of the powerful random forest machine learning model. While it complements my conceptual explanation of random forests, it can also be understood independently as long as you grasp the basic idea of decision trees and random forests. I have provided the complete project, including the data, on GitHub, and you can download the data file and Jupyter Notebook from Google Drive. All you need is a laptop with Python installed and the ability to initiate a Jupyter Notebook to follow along. (For guidance on installing Python and running a Jupyter Notebook, refer to this guide.) Although Python code will be used, its purpose is not to intimidate but rather to demonstrate how accessible machine learning has become with the resources available today! While this project covers a few essential machine learning topics, I will strive to explain them clearly and provide additional learning resources for those who are interested.

Problem Introduction

The problem we are addressing involves predicting tomorrow's maximum temperature in our city using one year of historical weather data. While I have chosen Seattle, WA as the city for this example, feel free to gather data for your own location using the NOAA Climate Data Online tool. Our goal is to make predictions without relying on existing weather forecasts, as it's more exciting to generate our own predictions. We have access to a year's worth of past maximum temperatures, as well as the temperatures from the previous two days and an estimate from a friend who claims to possess comprehensive weather knowledge. This is a supervised regression machine learning problem. It is considered supervised because we have both the features (data for the city) and the targets (temperature) that we want to predict. During the training process, we provide the random forest algorithm with both the features and targets, enabling it to learn how to map the data to a prediction. Furthermore, this task falls under regression since the target value is continuous, in contrast to discrete classes encountered in classification. With this background information established, let's dive into the implementation!

Roadmap

Before diving into programming, it's important to outline a concise guide to keep us focused. The following steps provide the foundation for any machine learning workflow once we have identified a problem and chosen a model:

Clearly state the question and determine the necessary data.
Obtain the data in a format that is easily accessible.
Identify and address any missing data points or anomalies as necessary.
Prepare the data to be suitable for the machine learning model.
Establish a baseline model that you aim to surpass.
Train the model using the training data.
Utilize the model to make predictions on the test data.
Compare the model's predictions to the known targets in the test set and calculate performance metrics.
If the model's performance is unsatisfactory, consider adjusting the model, acquiring more data, or trying a different modeling technique.
Interpret the model's outcomes and report the results in both visual and numerical formats.

Data Acquisition

To begin, we require a dataset. For the purpose of a realistic example, I obtained weather data for Seattle, WA from the year 2016 using the NOAA Climate Data Online tool. Typically, approximately 80% of the time dedicated to data analysis involves cleaning and retrieving data. However, this workload can be minimized by identifying high-quality data sources. The NOAA tool proves to be remarkably user-friendly, enabling us to download temperature data in the form of clean CSV files that can be parsed using programming languages like Python or R. For those who wish to follow along, the complete data file is available for download.

The following Python code loads in the csv data and displays the structure of the data:

# Pandas is used for data manipulation
import pandas as pd
# Read in data and display first 5 rows
features = pd.read_csv('temps.csv')
features.head(5)

year: The year, which is consistent at 2016 for all data points.
month: The numerical representation of the month in the year.
day: The numerical representation of the day in the year.
week: The day of the week, expressed as a character string.
temp_2: The maximum temperature recorded two days prior.
temp_1: The maximum temperature recorded one day prior.
average: The historical average maximum temperature.
actual: The actual measured maximum temperature.
friend: Your friend's prediction, which is a random number generated between 20 below the average and 20 above the average.

Identify Anomalies/ Missing Data

Upon examining the dimensions of the data, we observe that there are only 348 rows, which does not align with the expected 366 days in the year 2016. Upon closer inspection of the NOAA data, I discovered that several days were missing. This serves as a valuable reminder that real-world data collection is never flawless. Missing data, as well as incorrect data or outliers, can impact the analysis. However, in this case, the impact of the missing data is expected to be minimal, and the overall data quality is good due to the reliable source.

print('The shape of our features is:', features.shape)
The shape of our features is: (348, 9)

To identify anomalies, we can quickly compute summary statistics.

# Descriptive statistics for each column
features.describe()

Upon initial inspection, there don't appear to be any data points that immediately stand out as anomalies, and there are no zeros in any of the measurement columns. Another effective method to assess data quality is by creating basic plots. Graphical representations often make it easier to identify anomalies compared to analyzing numerical values alone. I have omitted the actual code here for plotting since it may not be intuitive in Python. However, please feel free to refer to the notebook for the complete implementation. As a good practice, I must admit that I mostly leveraged existing plotting code from Stack Overflow, as many data scientists do.

Examining the quantitative statistics and the graphs, we can feel confident in the high quality of our data. There are no clear outliers, and although there are a few missing points, they will not detract from the analysis.

Data Preparation

However, we're not yet at a stage where we can directly input raw data into a model and expect it to provide accurate answers (although researchers are actively working on this!). We need to perform some preprocessing to make our data understandable by machine learning algorithms. For data manipulation, we will utilize the Python library Pandas, which provides a convenient data structure known as a dataframe, resembling an Excel spreadsheet with rows and columns.

The specific steps for data preparation will vary based on the chosen model and the collected data. However, some level of data manipulation is typically necessary for any machine learning application.

One important step in our case is known as one-hot encoding. This process converts categorical variables, such as days of the week, into a numerical representation without any arbitrary ordering. While we intuitively understand the concept of weekdays, machines lack this inherent knowledge. Computers primarily comprehend numbers, so it's crucial to accommodate them for machine learning purposes. Rather than simply mapping weekdays to numeric values from 1 to 7, which might introduce unintended bias due to the numerical order, we employ a technique called one-hot encoding. This transforms a single column representing weekdays into seven binary columns. Let me illustrate this visually:

and turns it into

So, if a data point is a Wednesday, it will have a 1 in the Wednesday column and a 0 in all other columns. This process can be done in pandas in a single line!

# One-hot encode the data using pandas get_dummies
features = pd.get_dummies(features)
# Display the first 5 rows of the last 12 columns
features.iloc[:,5:].head(5)

Snapshot of data after one-hot encoding:

The dimensions of our data have now become 349 x 15, with all columns consisting of numerical values, which is ideal for our algorithm!

Next, we need to split the data into features and targets. The target, also known as the label, represents the value we want to predict, which in this case is the actual maximum temperature. The features encompass all the columns that the model will utilize to make predictions. Additionally, we will convert the Pandas dataframes into Numpy arrays, as that is the expected format for the algorithm. To retain the column headers, which correspond to the feature names, we will store them in a list for potential visualization purposes later on.

# Use numpy to convert to arrays
import numpy as np
# Labels are the values we want to predict
labels = np.array(features['actual'])
# Remove the labels from the features
# axis 1 refers to the columns
features= features.drop('actual', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features)

The next step in data preparation involves splitting the data into training and testing sets. During the training phase, we expose the model to the answers (in this case, the actual temperatures) so that it can learn how to predict temperatures based on the given features. We anticipate a relationship between the features and the target value, and the model's task is to learn this relationship during training. When it comes to evaluating the model's performance, we ask it to make predictions on a separate testing set where it only has access to the features (without the answers). Since we have the actual answers for the test set, we can compare the model's predictions against the true values to assess its accuracy.

Typically, when training a model, we randomly split the data into training and testing sets to ensure a representative sample of all data points. If we were to train the model solely on the data from the first nine months of the year and then use the final three months for prediction, the model's performance would be suboptimal because it hasn't encountered any data from those last three months. In this case, I am setting the random state to 42, which ensures that the results of the split remain consistent across multiple runs, thus enabling reproducible results.
The following code splits the data sets with another single line:

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

We can look at the shape of all the data to make sure we did everything correctly. We expect the training features number of columns to match the testing feature number of columns and the number of rows to match for the respective training and testing features and the labels :

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
Training Features Shape: (261, 14)
Training Labels Shape: (261,)
Testing Features Shape: (87, 14)
Testing Labels Shape: (87,)

It seems that everything is in order! Let's recap the steps we took to prepare the data for machine learning:

One-hot encoded categorical variables.
Split the data into features and labels.
Converted the data into arrays.
Split the data into training and testing sets. Depending on the initial dataset, there may be additional tasks involved, such as handling outliers, imputing missing values, or transforming temporal variables into cyclical representations. These steps may appear arbitrary initially, but once you grasp the basic workflow, you'll find that it remains largely consistent across various machine learning problems. Ultimately, the goal is to convert human-readable data into a format that can be comprehended by a machine learning model.

Establish Baseline

Prior to making and evaluating predictions, it is essential to establish a baseline—a reasonable benchmark that we aim to surpass with our model. If our model fails to improve upon the baseline, it indicates that either we should explore alternative models or acknowledge that machine learning may not be suitable for our specific problem. In our case, the baseline prediction can be derived from the historical average maximum temperatures. Put simply, our baseline represents the error we would incur if we were to predict the average maximum temperature for all days.

# The baseline predictions are the historical averages
baseline_preds = test_features[:, feature_list.index('average')]
# Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds - test_labels)
print('Average baseline error: ', round(np.mean(baseline_errors), 2))
Average baseline error:  5.06 degrees.

We now have our goal! If we can’t beat an average error of 5 degrees, then we need to rethink our approach.

Train Model

After completing the data preparation steps, the process of creating and training the model becomes relatively straightforward using Scikit-learn. We can accomplish this by importing the random forest regression model from Scikit-learn, initializing an instance of the model, and fitting (Scikit-learn's term for training) the model with the training data. To ensure reproducible results, we can set the random state. Remarkably, this entire process can be achieved in just three lines of code in Scikit-learn!

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

Make Predictions on the Test Set

Now that our model has been trained to learn the relationships between the features and targets, the next step is to evaluate its performance. To achieve this, we need to make predictions on the test features (ensuring the model does not have access to the test answers). Subsequently, we compare these predictions to the known answers. In regression tasks, it is crucial to use the absolute error metric, as we anticipate a range of both low and high values in our predictions. We are primarily interested in quantifying the average difference between our predictions and the actual values, hence the use of absolute error (as we did when establishing the baseline).

In Scikit-learn, making predictions with our model is as simple as executing a single line of code.

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
Mean Absolute Error: 3.83 degrees.

Our average estimate is off by 3.83 degrees. That is more than a 1 degree average improvement over the baseline. Although this might not seem significant, it is nearly 25% better than the baseline, which, depending on the field and the problem, could represent millions of dollars to a company.

Determine Performance Metrics

To put our predictions in perspective, we can calculate an accuracy using the mean average percentage error subtracted from 100 %.

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
Accuracy: 93.99 %.

That looks pretty good! Our model has learned how to predict the maximum temperature for the next day in Seattle with 94% accuracy.

Improve Model if Necessary

At this point in the typical machine learning workflow, we would typically move on to hyperparameter tuning. This process involves adjusting the settings of the model to enhance its performance. These settings are known as hyperparameters, distinguishing them from the model parameters learned during training. The most common approach to hyperparameter tuning involves creating multiple models with different settings, evaluating them all on the same validation set, and determining which configuration yields the best performance. However, manually conducting this process would be laborious, so automated methods are available in Scikit-learn to simplify the task. It's important to note that hyperparameter tuning is often more of an engineering practice than theory-based, and I encourage those interested to explore the documentation and begin experimenting. Achieving an accuracy of 94% is considered satisfactory for this problem. However, it's worth noting that the initial model built is unlikely to be the one that makes it into production, as model improvement is an iterative process.

Interpret Model and Report Results

At this point, we know our model is good, but it’s pretty much a black box. We feed in some Numpy arrays for training, ask it to make a prediction, evaluate the predictions, and see that they are reasonable. The question is: how does this model arrive at the values? There are two approaches to get under the hood of the random forest: first, we can look at a single tree in the forest, and second, we can look at the feature importances of our explanatory variables.

Visualizing a Single Decision Tree

One of the coolest parts of the Random Forest implementation in Skicit-learn is we can actually examine any of the trees in the forest. We will select one tree, and save the whole tree as an image.

The following code takes one tree from the forest and saves it as an image.

# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

Let’s take a look:

Wow! That looks like quite an expansive tree with 15 layers (in reality this is quite a small tree compared to some I’ve seen). You can download this image yourself and examine it in greater detail, but to make things easier, I will limit the depth of trees in the forest to produce an understandable image.

# Limit depth of tree to 3 levels
rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3)
rf_small.fit(train_features, train_labels)
# Extract the small tree
tree_small = rf_small.estimators_[5]
# Save the tree as a png image
export_graphviz(tree_small, out_file = 'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png');

Here is the reduced size tree annotated with labels

Based solely on this decision tree, we can make predictions for new data points. Let's consider an example of predicting the maximum temperature for Wednesday, December 27, 2017, with the following values: temp_2 = 39, temp_1 = 35, average = 44, and friend = 30.

Starting at the root node, we encounter the first question, where the answer is True because temp_1 ≤ 59.5. We proceed to the left and come across the second question, which is also True since average ≤ 46.8. Continuing to the left, we reach the third and final question, which is again True because temp_1 ≤ 44.5. As a result, we conclude that our estimate for the maximum temperature is 41.0 degrees, as indicated by the value in the leaf node.

An interesting observation is that the root node only contains 162 samples, despite there being 261 training data points. This is because each tree in the random forest is trained on a random subset of the data points with replacement, a technique known as bagging (bootstrap aggregating). If we want to use all the data points without sampling with replacement, we can disable it by setting bootstrap = False when constructing the forest. The combination of random sampling of data points and a subset of features at each node is why the model is referred to as a "random" forest.

Furthermore, it is worth noting that in our decision tree, we only utilized two variables to make predictions. According to this specific tree, the remaining features such as the month of the year, day of the month, and our friend's prediction are deemed irrelevant for predicting tomorrow's maximum temperature. Our tree's visual representation has increased our understanding of the problem domain, enabling us to discern which data to consider when making predictions.

Variable Importances

To assess the significance of all the variables within the random forest, we can examine their relative importances. The importances, obtained from Scikit-learn, indicate how much including a particular variable enhances the prediction. While the precise calculation of importance is beyond the scope of this post, we can utilize these values to make relative comparisons between variables.

The provided code leverages several useful techniques in the Python language, including list comprehensions, zip, sorting, and argument unpacking. While comprehending these techniques is not crucial at the moment, they are valuable tools to have in your Python repertoire if you aspire to enhance your proficiency with the language.

# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Variable: temp_1               Importance: 0.7
Variable: average              Importance: 0.19
Variable: day                  Importance: 0.03
Variable: temp_2               Importance: 0.02
Variable: friend               Importance: 0.02
Variable: month                Importance: 0.01
Variable: year                 Importance: 0.0
Variable: week_Fri             Importance: 0.0
Variable: week_Mon             Importance: 0.0
Variable: week_Sat             Importance: 0.0
Variable: week_Sun             Importance: 0.0
Variable: week_Thurs           Importance: 0.0
Variable: week_Tues            Importance: 0.0
Variable: week_Wed             Importance: 0.0

At the top of the importance list is "temp_1," the maximum temperature of the day before. This finding confirms that the best predictor of the maximum temperature for a given day is the maximum temperature recorded on the previous day, which aligns with our intuition. The second most influential factor is the historical average maximum temperature, which is also a logical result. Surprisingly, your friend's prediction, along with variables such as the day of the week, year, month, and temperature two days prior, appear to be unhelpful in predicting the maximum temperature. These importances make sense, as we wouldn't expect the day of the week to have any bearing on the weather. Additionally, the year remains the same for all data points, rendering it useless for predicting the maximum temperature.

In future implementations of the model, we can exclude these variables with negligible importance, and the performance will not suffer. Moreover, if we were to employ a different model, such as a support vector machine, we could utilize the random forest feature importances as a form of feature selection. To demonstrate this, we can swiftly construct a random forest using only the two most significant variables—the maximum temperature one day prior and the historical average—and compare its performance to the original model.

# New random forest with only the two most important variables
rf_most_important = RandomForestRegressor(n_estimators= 1000, random_state=42)
# Extract the two most important features
important_indices = [feature_list.index('temp_1'), feature_list.index('average')]
train_important = train_features[:, important_indices]
test_important = test_features[:, important_indices]
# Train the random forest
rf_most_important.fit(train_important, train_labels)
# Make predictions and determine the error
predictions = rf_most_important.predict(test_important)
errors = abs(predictions - test_labels)
# Display the performance metrics
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
mape = np.mean(100 * (errors / test_labels))
accuracy = 100 - mape
print('Accuracy:', round(accuracy, 2), '%.')
Mean Absolute Error: 3.9 degrees.
Accuracy: 93.8 %.

This insight highlights that we do not necessarily require all the collected data to make accurate predictions. In fact, if we were to continue using this model, we could narrow down our data collection to just the two most significant variables and achieve nearly the same level of performance. However, in a production setting, we would need to consider the trade-off between decreased accuracy and the additional time and resources required to gather more information. Striking the right balance between performance and cost is a vital skill for a machine learning engineer and will ultimately depend on the specific problem at hand.

At this stage, we have covered the fundamentals of implementing a random forest model for a supervised regression problem. We can be confident that our model can predict the maximum temperature for tomorrow with 94% accuracy, leveraging one year of historical data. From here, feel free to experiment with this example or apply the model to a dataset of your choice. To conclude, I will delve into a few visualizations. As a data scientist, I find great joy in creating graphs and models, and visualizations not only provide aesthetic pleasure but also assist us in diagnosing our model by condensing a wealth of numerical information into easily comprehensible images.

Visualizations

To visualize the discrepancies in the relative importance of the variables, I will create a straightforward bar plot of the feature importances. Plotting in Python can be a bit unintuitive, and I often find myself searching for solutions on Stack Overflow when creating graphs. Don't worry if the code provided doesn't fully make sense—sometimes, understanding every line of code isn't essential to achieve the desired outcome!

# Import matplotlib for plotting and use magic command for Jupyter Notebooks
import matplotlib.pyplot as plt
%matplotlib inline
# Set the style
plt.style.use('fivethirtyeight')
# list of x locations for plotting
x_values = list(range(len(importances)))
# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')
# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

Next, we can plot the entire dataset with predictions highlighted. This requires a little data manipulation, but its not too difficult. We can use this plot to determine if there are any outliers in either the data or our predictions.

# Use datetime for creating date objects for plotting
import datetime
# Dates of training values
months = features[:, feature_list.index('month')]
days = features[:, feature_list.index('day')]
years = features[:, feature_list.index('year')]
# List and then convert to datetime object
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
# Dataframe with true values and dates
true_data = pd.DataFrame(data = {'date': dates, 'actual': labels})
# Dates of predictions
months = test_features[:, feature_list.index('month')]
days = test_features[:, feature_list.index('day')]
years = test_features[:, feature_list.index('year')]
# Column of dates
test_dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
# Convert to datetime objects
test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates]
# Dataframe with predictions and dates
predictions_data = pd.DataFrame(data = {'date': test_dates, 'prediction': predictions})
# Plot the actual values
plt.plot(true_data['date'], true_data['actual'], 'b-', label = 'actual')
# Plot the predicted values
plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction')
plt.xticks(rotation = '60'); 
plt.legend()
# Graph labels
plt.xlabel('Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual and Predicted Values');

Creating a visually appealing graph does require a bit of effort, but the end result is worth it! From the data, it appears that we don't have any noticeable outliers that need to be addressed. To gain further insights into the model's performance, we can plot the residuals (i.e., the errors) to determine if the model tends to over-predict or under-predict. Additionally, examining the distribution of residuals can help assess if they follow a normal distribution. However, for the purpose of this final chart, I will focus on visualizing the actual values, the temperature one day prior, the historical average, and our friend's prediction. This visualization will help us discern the difference between useful variables and those that provide less valuable information.

# Make the data accessible for plotting
true_data['temp_1'] = features[:, feature_list.index('temp_1')]
true_data['average'] = features[:, feature_list.index('average')]
true_data['friend'] = features[:, feature_list.index('friend')]
# Plot all the data as lines
plt.plot(true_data['date'], true_data['actual'], 'b-', label  = 'actual', alpha = 1.0)
plt.plot(true_data['date'], true_data['temp_1'], 'y-', label  = 'temp_1', alpha = 1.0)
plt.plot(true_data['date'], true_data['average'], 'k-', label = 'average', alpha = 0.8)
plt.plot(true_data['date'], true_data['friend'], 'r-', label = 'friend', alpha = 0.3)
# Formatting plot
plt.legend(); plt.xticks(rotation = '60');
# Lables and title
plt.xlabel('Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual Max Temp and Variables');

The lines on the chart may appear a bit crowded, but we can still observe why the maximum temperature one day prior and the historical average maximum temperature are valuable for predicting the maximum temperature. Conversely, it's evident that our friend's prediction does not provide significant predictive power (but let's not completely dismiss our friend's input, although we should exercise caution in relying heavily on their estimate). Creating graphs like this in advance can assist us in selecting the appropriate variables to include in our model, and they also serve as valuable diagnostic tools. Just as in Anscombe's quartet, graphs often reveal insights that quantitative numbers alone may overlook. Including visualizations as part of any machine learning workflow is highly recommended.

Conclusions

With the inclusion of these graphs, we have successfully completed an end-to-end machine learning example! To further enhance our model, we can explore different hyperparameters, experiment with alternative algorithms, or, perhaps most effectively, gather more data. The performance of any model is directly influenced by the quantity and quality of the data it learns from, and our training data was relatively limited. I encourage everyone to continue refining this model and share their findings. Furthermore, for those interested in delving deeper into the theory and practical application of random forests, there are numerous free online resources available. If you're seeking a comprehensive book that covers both theory and Python implementations of machine learning models, I highly recommend "Hands-On Machine Learning with Scikit-Learn and TensorFlow." Lastly, I hope that those who have followed along with this example have realized the accessibility of machine learning and are motivated to join the inclusive and supportive machine learning community.

What is Lasso Regression?

Tamal Barman — Sat, 06 May 2023 04:20:38 +0000

Lasso regression, also known as L1 regularization, is a linear regression technique that adds a penalty term to the ordinary least squares (OLS) cost function. The penalty term is the absolute value of the regression coefficients multiplied by a tuning parameter, lambda. The purpose of this penalty term is to shrink the coefficients towards zero, which helps to reduce overfitting and select a subset of the most important features.

Lasso regression can be used for feature selection, as it tends to set the coefficients of less important features to zero. This is because the penalty term encourages sparsity in the coefficient estimates, meaning that it favors solutions where many of the coefficients are exactly zero. As a result, Lasso regression is particularly useful when working with high-dimensional data sets where there may be many irrelevant features.

Compared to Ridge regression, which uses a penalty term based on the square of the regression coefficients, Lasso regression is more likely to result in a sparse model with only a subset of the features having non-zero coefficients. However, Lasso regression may not perform as well as Ridge regression in situations where all the features are relevant, as it can be more prone to producing biased estimates of the coefficients.

Lasso Meaning:

In machine learning, Lasso or Lasso regression is a technique used for linear regression that adds a penalty term to the cost function. This penalty term is the absolute value of the sum of the regression coefficients multiplied by a tuning parameter, which is typically set through cross-validation. The purpose of this penalty term is to shrink the coefficients towards zero, thereby reducing the effect of less important features and selecting a subset of the most relevant features for the model.
The word “LASSO” stands for Least Absolute Shrinkage and Selection Operator.

Regularization:

Lasso regularization is particularly useful for feature selection in high-dimensional data sets, where there are many irrelevant features. It tends to set the coefficients of less important features to zero, which makes the resulting model simpler and easier to interpret. Compared to Ridge regression, which uses a penalty term based on the square of the regression coefficients, Lasso regression is more likely to produce sparse models with fewer non-zero coefficients.
The purpose of regularization in Lasso is to prevent overfitting, which occurs when the model fits the training data too closely and performs poorly on new data. Regularization achieves this by shrinking the coefficients towards zero, which reduces the effect of less important features and helps to select a subset of the most important features.

Lasso Regularization Technique:

Lasso regularization is a technique used in linear regression to prevent overfitting and improve the predictive performance of the model. It achieves this by adding a penalty term to the ordinary least squares (OLS) cost function, which is a linear combination of the sum of squared errors and the sum of the absolute values of the regression coefficients.

The Lasso penalty term is given by the formula:

lambda * (|b1| + |b2| +... + |bp|)

where lambda is a tuning parameter that controls the degree of regularization, b1, b2,..., bp are the regression coefficients, and p is the number of predictors or features in the model.

The purpose of the penalty term is to shrink the coefficients towards zero, which reduces the effect of less important features and helps to select a subset of the most important features. The Lasso penalty term has the beneficial property of promoting sparsity in coefficient estimates and supporting solutions where a significant portion of the coefficients are exactly zero.

Lasso regularization can be used for feature selection, as it tends to set the coefficients of less important features to zero. This is particularly useful when working with high-dimensional data sets where there may be many irrelevant features. Compared to Ridge regression, which uses a penalty term based on the square of the regression coefficients, Lasso regression is more likely to produce sparse models with fewer non-zero coefficients.

L1 Regularization:

L1 regularization, also known as Lasso regularization, is a technique used in linear regression to prevent overfitting and improve the predictive performance of the model. It involves adding a penalty term to the ordinary least squares (OLS) cost function, which is a linear combination of the sum of squared errors and the sum of the absolute values of the regression coefficients.

The L1 penalty term is given by the formula:

lambda * (|b1| + |b2| + ... + |bp|)

where lambda is a tuning parameter that controls the degree of regularization, b1, b2, ..., bp are the regression coefficients, and p is the number of predictors or features in the model.

Mathematical equation of Lasso Regression:

The mathematical equation for Lasso regression can be written as:

minimize (1/2m) * ||y - Xβ||^2 + λ * ||β||

where:

y is the vector of response variables
X is the matrix of predictor variables
β is the vector of regression coefficients
m is the number of observations
λ is the regularization parameter, controlling the strength of the penalty term
||.|| is the L1 norm, which is the sum of the absolute values of the coefficients

The first term in the equation is the ordinary least squares (OLS) cost function, which measures the difference between the predicted values of y and the actual values. The second term is the L1 penalty term, which is the sum of the absolute values of the regression coefficients multiplied by the regularization parameter λ.

The objective of Lasso regression is to find the values of the regression coefficients that minimize the cost function while simultaneously keeping the size of the coefficients small. The L1 penalty term encourages sparsity in the coefficient estimates, meaning that it tends to set many of the coefficients to zero. This results in a simpler model with fewer non-zero coefficients, which is easier to interpret and less prone to overfitting.

Lasso Regression Example:

import pandas as pd

Creating a New Train and Validation Datasets:

from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score

Load the data:

data = pd.read_csv('data.csv')

Split the data into k subsets:

kf = KFold(n_splits=5, shuffle=True)

Create the lasso regression model:

model = LassoCV(cv=kf, random_state=0)

Fit the model:

X = data.drop('response_variable', axis=1)
y = data['response_variable']
model.fit(X, y)

Predict and evaluate the model:

y_pred = model.predict(X)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f'MSE: {mse:.2f}, R-squared: {r2:.2f}')

Identify the subset of predictors:

coef = pd.Series(model.coef_, index=X.columns)
selected_features = coef[coef != 0].index.tolist()
print(f'Selected features: {selected_features}')

Output:

0.7335508027883148

The Lasso Regression attained an accuracy of 73% with the given Dataset.