DEV Community

Kamaumbugua-dev
Kamaumbugua-dev

Posted on

Predicting House Prices with Python: Data Cleaning, Modeling, and Feature Importance

Introduction

In this project, I tackled a classic machine learning problem: predicting house prices based on various property features. The journey involved real-world data cleaning, feature engineering, model building, and interpreting results using both standard regression metrics and ANOVA-based feature importance. Here’s a summary of my approach, key insights, and the skills I developed along the way.


Project Workflow

1. Data Cleaning

Real-world datasets are rarely perfect. My first step was to ensure the data was clean and consistent:

  • Standardized column names by removing extra spaces, converting to lowercase, and replacing spaces with underscores.
  • Handled missing values by filling numeric columns with their mean values.
  • Standardized categorical values (like location, furnishing, and house_condition) by correcting typos and ensuring consistent capitalization.
  • Converted categorical variables to numeric using one-hot encoding.
  • Ensured all features were numeric and dropped or filled any remaining missing values.
  • Removed duplicate rows to avoid bias in modeling.

2. Feature Engineering

  • Derived new columns where useful (e.g., converting year built to house age).
  • Prepared categorical features for modeling by encoding them numerically.

3. Model Building

  • Model Used: Linear Regression from scikit-learn.
  • Training/Test Split: 80% of the data was used for training, 20% for testing.
  • Evaluation Metrics:
    • Mean Squared Error (MSE)
    • Root Mean Squared Error (RMSE)
    • Mean Absolute Error (MAE)
    • R² Score

Results:

  • MSE: 7.80e-22 (almost zero)
  • RMSE: 2.79e-11 (almost zero)
  • MAE: 1.74e-11 (almost zero)
  • R² Score: 1.0 (perfect fit)

Note: Such perfect results are rare in real-world scenarios and may indicate a very simple dataset or potential data leakage. Always double-check your data pipeline!

4. Feature Importance with ANOVA

To understand which features most influence house prices, I used ANOVA (Analysis of Variance) via f_regression from scikit-learn. This provided F-values and p-values for each feature, highlighting their statistical significance.

Key Insights from ANOVA:

  • Most Important Predictors:
    • Converted_datatype_for_price($), Size_sqft, Converted_datatype_for_size_sqft
    • House_condition_New, House_condition_Old
    • Has_pool, Year_built
  • Moderately Important:
    • Furnishing_Semi-Furnished, Lot_size
  • Not Significant:
    • Bath_rooms, Garage_available, Location_Urban, and others

Visualization

I visualized the F-values and p-values using a heatmap to quickly identify the most influential features:

import matplotlib.pyplot as plt
import seaborn as sns

heatmap_data = anova_results.set_index('Feature')[['F_value', 'p_value']]
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt=".2e")
plt.title('ANOVA F-value and p-value Heatmap for Features')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Skills and Experience Gained

  • Data Cleaning: Learned to handle missing values, standardize data, and ensure consistency.
  • Feature Engineering: Gained experience in transforming and encoding features for machine learning.
  • Model Evaluation: Used multiple regression metrics to assess model performance.
  • Statistical Analysis: Applied ANOVA to interpret feature importance and guide model refinement.
  • Visualization: Created clear plots to communicate results and insights.
  • Critical Thinking: Recognized the importance of checking for data leakage and overfitting.

Conclusion

This project was a comprehensive exercise in the end-to-end machine learning workflow, from raw data to actionable insights. The experience reinforced the importance of data preparation, careful model evaluation, and statistical interpretation in building robust predictive models.

**Thanks for reading! If you have questions or want to discuss more about data science and machine learning, feel free to

Top comments (0)