DEV Community

Cover image for Predicting House Prices with Scikit-learn: A Complete Guide
Amit Chandra
Amit Chandra

Posted on

Predicting House Prices with Scikit-learn: A Complete Guide

Machine learning is transforming various industries, including real estate. One common task is predicting house prices based on various features such as the number of bedrooms, bathrooms, square footage, and location. In this article, we will explore how to build a machine learning model using scikit-learn to predict house prices, covering all aspects from data preprocessing to model deployment.

Table of Contents

  1. Introduction to Scikit-learn
  2. Problem Definition
  3. Data Collection
  4. Data Preprocessing
  5. Feature Selection
  6. Model Training
  7. Model Evaluation
  8. Model Tuning (Hyperparameter Optimization)
  9. Model Deployment
  10. Conclusion

1. Introduction to Scikit-learn

Scikit-learn is one of the most widely used libraries for machine learning in Python. It offers simple and efficient tools for data analysis and modeling. Whether you’re dealing with classification, regression, clustering, or dimensionality reduction, scikit-learn provides an extensive set of utilities to help you build robust machine learning models.

In this guide, we’ll build a regression model using scikit-learn to predict house prices. Let’s walk through each step of the process.


2. Problem Definition

The task at hand is to predict the price of a house based on its features such as:

  • Number of bedrooms
  • Number of bathrooms
  • Area (in square feet)
  • Location

This is a supervised learning problem where the target variable (house price) is continuous, making it a regression task. Scikit-learn provides a variety of algorithms for regression, such as Linear Regression and Random Forest, which we will use in this project.


3. Data Collection

You can either use a real-world dataset like the Kaggle House Prices dataset or gather your own data from a public API.

Here’s a sample of how your data might look:

Bedrooms Bathrooms Area (sq.ft) Location Price ($)
3 2 1500 Boston 300,000
4 3 2000 Seattle 500,000

The target variable here is the Price.


4. Data Preprocessing

Before feeding the data into a machine learning model, we need to preprocess it. This includes handling missing values, encoding categorical features, and scaling the data.

Handling Missing Data

Missing data is common in real-world datasets. We can either fill missing values with a statistical measure like the median or drop rows with missing data:

data.fillna(data.median(), inplace=True)
Enter fullscreen mode Exit fullscreen mode

Encoding Categorical Features

Since machine learning models require numerical input, we need to convert categorical features like Location into numbers. Label Encoding assigns a unique number to each category:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data['Location'] = encoder.fit_transform(data['Location'])
Enter fullscreen mode Exit fullscreen mode

Feature Scaling

It’s important to scale features like Area and Price to ensure that they are on the same scale, especially for algorithms sensitive to feature magnitude. Here’s how we apply scaling:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

5. Feature Selection

Not all features contribute equally to the target variable. Feature selection helps in identifying the most important features, which improves model performance and reduces overfitting.

In this project, we use SelectKBest to select the top 5 features based on their correlation with the target variable:

from sklearn.feature_selection import SelectKBest, f_regression
selector = SelectKBest(score_func=f_regression, k=5)
X_new = selector.fit_transform(X, y)
Enter fullscreen mode Exit fullscreen mode

6. Model Training

Now that we have preprocessed the data and selected the best features, it’s time to train the model. We’ll use two regression algorithms: Linear Regression and Random Forest.

Linear Regression

Linear regression fits a straight line through the data, minimizing the difference between the predicted and actual values:

from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Random Forest

Random Forest is an ensemble method that uses multiple decision trees and averages their results to improve accuracy and reduce overfitting:

from sklearn.ensemble import RandomForestRegressor
forest_model = RandomForestRegressor(n_estimators=100)
forest_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Train-Test Split

To evaluate how well our models generalize, we split the data into training and testing sets:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

7. Model Evaluation

After training the models, we need to evaluate their performance using metrics like Mean Squared Error (MSE) and R-squared (R²).

Mean Squared Error (MSE)

MSE calculates the average squared difference between the predicted and actual values. A lower MSE indicates better performance:

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
Enter fullscreen mode Exit fullscreen mode

R-squared (R²)

R² tells us how well the model explains the variance in the target variable. A value of 1 means perfect prediction:

from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
Enter fullscreen mode Exit fullscreen mode

Compare the performance of the Linear Regression and Random Forest models using these metrics.


8. Model Tuning (Hyperparameter Optimization)

To further improve model performance, we can fine-tune the hyperparameters. For Random Forest, hyperparameters like n_estimators (number of trees) and max_depth (maximum depth of trees) can significantly impact performance.

Here’s how to use GridSearchCV for hyperparameter optimization:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20]
}

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
Enter fullscreen mode Exit fullscreen mode

9. Model Deployment

Once you’ve trained and tuned the model, the next step is deployment. You can use Flask to create a simple web application that serves predictions.

Here’s a basic Flask app to serve house price predictions:

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load the trained model
model = joblib.load('best_model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    prediction = model.predict([data['features']])
    return jsonify({'predicted_price': prediction[0]})

if __name__ == '__main__':
    app.run()
Enter fullscreen mode Exit fullscreen mode

Save the trained model using joblib:

import joblib
joblib.dump(best_model, 'best_model.pkl')
Enter fullscreen mode Exit fullscreen mode

This way, you can make predictions by sending requests to the API.


10. Conclusion

In this project, we explored the entire process of building a machine learning model using scikit-learn to predict house prices. From data preprocessing and feature selection to model training, evaluation, and deployment, each step was covered with practical code examples.

Whether you’re new to machine learning or looking to apply scikit-learn in real-world projects, this guide provides a comprehensive workflow that you can adapt for various regression tasks.

Feel free to experiment with different models, datasets, and techniques to enhance the performance and accuracy of your model.

Regression #AI #DataAnalysis #DataPreprocessing #MLModel #RandomForest #LinearRegression #Flask #APIDevelopment #RealEstate #TechBlog #Tutorial #DataEngineering #DeepLearning #PredictiveAnalytics #DevCommunity

Top comments (6)

Collapse
 
alexander_grumic_cf3f2d65 profile image
Alexander Grumic

This is the first post of yours that I've read... Great work! 👍 I assume this is the first of many. Keep up the good work, and thanks for sharing!

Collapse
 
amitchandra profile image
Amit Chandra

Thank you so much for the kind words! 😊 I'm really glad you enjoyed the article. This is indeed the first of many, and I'm excited to share more in the future. Your support means a lot—thanks for reading and engaging! 🙌

Collapse
 
vortico profile image
Vortico

Hey, great post! We really enjoyed it. You might be interested in knowing how to productionalise ML models with a simple line of code. If so, please have a look at flama for Python. Some time ago we published a post Introducing Flama for Robust ML APIs. We think you might really enjoy the post, and flama.
If you have any doubts, or you'd like to learn more about it and how it works in more detail, don't hesitate to give us a shout. And if you like it, please gift us a star ⭐ here.

Collapse
 
amitchandra profile image
Amit Chandra

Thank you for the feedback and for introducing me to Flama! 😊 I’m always eager to learn more about tools that simplify the process of productionalizing ML models, so I’ll definitely check out your post and Flama. It sounds like a great resource for building robust ML APIs.

I appreciate the recommendation—I'll give it a read, and if I find it useful, you can expect a star from me! ⭐

Collapse
 
migduroli profile image
migduroli

I recommend to have a look at flama, an open-source project which is specifically thought for the productionalization__ of ML models via ML APIs. To have a look at an actual example of an entire ML pipeline run with flama, you can check this post, which I think contains all the relevant information.

Collapse
 
amitchandra profile image
Amit Chandra

Sure I'll check it out.