Tamal Barman

Posted on Mar 5 • Edited on Mar 18

Navigating the Housing Market Storm: A Data-Driven Approach

#python #machinelearning #datascience

Introduction

In the vast landscape of the real estate market, understanding the dynamics that influence housing prices is crucial. In this blog post, we embark on a data-driven journey to explore the intricacies of housing price prediction, using advanced regression techniques and ensemble learning. The dataset under scrutiny is the well-known "House Prices: Advanced Regression Techniques" dataset from Kaggle.

Explanation of Random Forest

Random forests are powerful predictive models that allow for data-driven exploration of many explanatory variables in predicting a response or target variable. They provide importance scores for each explanatory variable and enable the evaluation of correct classification with varying numbers of trees.

Data Preprocessing

Before diving into the Random Forest analysis, it's essential to preprocess the data. This includes handling missing values, encoding categorical variables, and scaling features to ensure the model performs optimally.

Importing Modules and Data

We kick off by importing essential libraries such as NumPy, Pandas, and Scikit-Learn. The dataset, split into training and testing sets, is loaded into our analysis environment.

# Importing Modules
import sklearn
import scipy
# ... (other module imports)

# Importing training and testing data
train_data = pd.read_csv("/content/train.csv", index_col="Id")
test_data = pd.read_csv("/content/test.csv", index_col="Id")

Explanation of Response and Explanatory Variables

In our analysis, the response variable (dependent variable) is the sale price of the houses, while the explanatory variables (independent variables) include various features such as the size of the house, number of bedrooms, location, etc. These variables were chosen based on their relevance to predicting housing prices.

Data Visualization

Scatter Plot to Check Raw Outliers
Our exploration begins with a scatter plot visualizing the relationship between the ground living area and sale prices. This aids in identifying potential outliers, setting the stage for data cleansing.

fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.scatter(train_data["GrLivArea"], train_data["SalePrice"], c="#3f72af", zorder=3, alpha=0.9)
ax.axvline(4500, c="#112d4e", ls="--", zorder=2)
ax.set_xlabel("Ground living area (sq. ft)", labelpad=10)
ax.set_ylabel("Sale price ($)", labelpad=10)

Data Cleaning

Removing outliers is a pivotal step in refining the dataset. In this case, we exclude instances where the ground living area exceeds 4450 sq. ft.

train_data = train_data[train_data["GrLivArea"] < 4450]
data = pd.concat([train_data.drop("SalePrice", axis=1), test_data])

Bar Graph to Check Missing Values

Understanding the prevalence of missing values guides our imputation strategy. A bar graph illustrates the number of missing values for each feature.

nans = data.isna().sum().sort_values(ascending=False)
nans = nans[nans > 0]
fig, ax = plt.subplots(figsize=(10, 6))
ax.grid()
ax.bar(nans.index, nans.values, zorder=2, color="#3f72af")
ax.set_ylabel("No. of missing values", labelpad=10)
ax.set_xlim(-0.6, len(nans) - 0.4)
ax.xaxis.set_tick_params(rotation=90)
plt.show()

Exploring Numerical Variables

We delve into the analysis of numerical features, distinguishing between discrete and continuous variables.

Discrete Values

# Continuous Values
continuous_variables = []
for feature in numerical_features:
    if feature not in discrete_variables and feature not in ["YearBuilt", "YearRemodAdd", "GarageYrBlt", "YrSold"]:
        continuous_variables.append(feature)

print(continuous_variables)

for feature in continuous_variables:
    train_data[feature].hist(bins=30)
    plt.title(feature)
    plt.show()

Continuous Values

# ... (code for identifying and visualizing continuous variables)

Categorical Variables

# Categorical Values
categorical_features = []
for feature in train_data.columns:
    if train_data[feature].dtype == 'O' and feature != 'SalePrice':
        categorical_features.append(feature)
print(categorical_features)

for feature in categorical_features:
    train_data.groupby(feature)['SalePrice'].mean().plot.bar()
    plt.title(feature + ' vs Sale Price')
    plt.show()

Data Transformation & Feature Scaling

Data Transformation

# Data Transformation
data[["MSSubClass", "YrSold"]] = data[["MSSubClass", "YrSold"]].astype("category")  # converting into categorical value
data["MoSoldsin"] = np.sin(2 * np.pi * data["MoSold"] / 12)  # Sine Function
data["MoSoldcos"] = np.cos(2 * np.pi * data["MoSold"] / 12)  # Cosine Function
data = data.drop("MoSold", axis=1)

Feature Scaling

# Feature Scaling
cols = data.select_dtypes(np.number).columns
data[cols] = RobustScaler().fit_transform(data[cols])

Encoding

data = pd.get_dummies(data)

Feature Recovery & Removing Outliers

X_train = data.loc[train_data.index]
X_test = data.loc[test_data.index]

Optimization, Training, and Testing

Hyperparameter Optimization

# Hyper Parameter Optimization
kf = KFold(n_splits=5, random_state=0, shuffle=True)
rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))
scorer = make_scorer(rmse, greater_is_better=False)

# We use Randomized Search for Optimization, since it is more efficient.
# Define a function which takes in model and parameter grid as inputs, uses Radomized search, and returns fit of the model
def random_search(model, grid, n_iter=100):
    if model == xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4):
        searchxg = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return searchxg.fit(X_train, y, early_stopping_rounds=5, verbose=True)
    else:
        search = RandomizedSearchCV(estimator=model, param_distributions=grid, cv=kf, n_iter=n_iter, n_jobs=4, random_state=0, verbose=True)
        return search.fit(X_train, y)

# Hyperparameter Grids
xgb_hpg = {'n_estimators': [100, 400, 800], 'max_depth': [3, 6, 9], 'learning_rate': [0.05, 0.1, 0.20], 'min_child_weight': [1, 10, 100]}  # XGBoost
ridge_hpg = {"alpha": np.logspace(-1, 2, 500)}  # Ridge Regressor
lasso_hpg = {"alpha": np.logspace(-5, -1, 500)}  # Lasso Regressor
svr_hpg = {"C": np.arange(1, 100), "gamma": np.linspace(0.00001, 0.001, 50), "epsilon": np.linspace(0.01, 0.1, 50)}  # Support Vector Regressor
lgbm_hpg = {"colsample_bytree": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # LGBM
gbm_hpg = {"max_features": np.linspace(0.2, 0.7, 6), "learning_rate": np.logspace(-3, -1, 100)}  # Gradient Boost
cat_hpg = {'depth': [2, 9], 'iterations': [10, 30], 'learning_rate': [0.001, 0.1]}

# Randomized Search for each model
xgb_search = random_search(xgb.XGBRegressor(n_estimators=1000, n_jobs=4), xgb_hpg)  # XGBoost
ridge_search = random_search(Ridge(), ridge_hpg)  # Ridge Regressor
lasso_search = random_search(Lasso(), lasso_hpg)  # Lasso Regressor
svr_search = random_search(SVR(), svr_hpg, n_iter=100)  # Support Vector Regressor
lgbm_search = random_search(LGBMRegressor(n_estimators=2000, max_depth=3), lgbm_hpg, n_iter=100)  # LGBM
gbm_search = random_search(GradientBoostingRegressor(n_estimators=2000, max_depth=3), gbm_hpg, n_iter=100)  # Gradient Boost

Random Forest Analysis

Now, let's run a Random Forest analysis to predict housing prices. We'll split the data into training and testing sets, train the model on the training data, and evaluate its performance on the testing data.

Ensemble Learning Model

# Ensemble Learning Model
models = [search.best_estimator_ for search in [xgb_search, ridge_search, lasso_search, svr_search, lgbm_search, gbm_search]]  # list of best estimators from each model
ensemble_search = random_search(StackingCVRegressor(models, Ridge(), cv=kf), {"meta_regressor__alpha": np.logspace(-3, -2, 500)}, n_iter=20)  # Ensemble Stack
models.append(ensemble_search.best_estimator_)  # list of best estimators from each model including Stack

Predicting Values & Submission

# Predicting Values & Submission
prediction = [i.predict(X_test) for i in models]  # Np array of Predictions
predictions = np.average(prediction, axis=0)  # average of all the values

# Convert the predictions into the given format, and finally convert them back to normal using the exponential function
my_prediction = pd.DataFrame({"Id": test_data.index, "SalePrice": np.exp(predictions)})  # given format
my_prediction.to_csv("E:\Education\Kaggle Projects\House Price - Advanced Regression/my_prediction_ensemble.csv", index=False)  # Saving to CSV

Output Interpretation

The accuracy score obtained from the Random Forest model indicates how well the model predicts housing prices based on the given features. We can further analyze the importance scores assigned to each explanatory variable to understand their impact on the prediction.

Conclusion

Our data-driven exploration and modelling journey provides valuable insights into predicting housing prices. By leveraging advanced regression techniques and ensemble learning, we navigate through challenges, optimize models, and make predictions that contribute to the dynamic landscape of the housing market.

Click here to view and run the code in Google Colab.

Stay tuned for more data-driven adventures!

DEV Community