Predicting House Prices with Python: Data Cleaning, Modeling, and Feature Importance

Introduction

In this project, I tackled a classic machine learning problem: predicting house prices based on various property features. The journey involved real-world data cleaning, feature engineering, model building, and interpreting results using both standard regression metrics and ANOVA-based feature importance. Here’s a summary of my approach, key insights, and the skills I developed along the way.

Project Workflow

1. Data Cleaning

Real-world datasets are rarely perfect. My first step was to ensure the data was clean and consistent:

Standardized column names by removing extra spaces, converting to lowercase, and replacing spaces with underscores.
Handled missing values by filling numeric columns with their mean values.
Standardized categorical values (like location, furnishing, and house_condition) by correcting typos and ensuring consistent capitalization.
Converted categorical variables to numeric using one-hot encoding.
Ensured all features were numeric and dropped or filled any remaining missing values.
Removed duplicate rows to avoid bias in modeling.

2. Feature Engineering

Derived new columns where useful (e.g., converting year built to house age).
Prepared categorical features for modeling by encoding them numerically.

3. Model Building

Model Used: Linear Regression from scikit-learn.
Training/Test Split: 80% of the data was used for training, 20% for testing.
Evaluation Metrics:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R² Score

Results:

MSE: 7.80e-22 (almost zero)
RMSE: 2.79e-11 (almost zero)
MAE: 1.74e-11 (almost zero)
R² Score: 1.0 (perfect fit)

Note: Such perfect results are rare in real-world scenarios and may indicate a very simple dataset or potential data leakage. Always double-check your data pipeline!

4. Feature Importance with ANOVA

To understand which features most influence house prices, I used ANOVA (Analysis of Variance) via f_regression from scikit-learn. This provided F-values and p-values for each feature, highlighting their statistical significance.

Key Insights from ANOVA:

Most Important Predictors:
- Converted_datatype_for_price($), Size_sqft, Converted_datatype_for_size_sqft
- House_condition_New, House_condition_Old
- Has_pool, Year_built
Moderately Important:
- Furnishing_Semi-Furnished, Lot_size
Not Significant:
- Bath_rooms, Garage_available, Location_Urban, and others

Visualization

I visualized the F-values and p-values using a heatmap to quickly identify the most influential features:

import matplotlib.pyplot as plt
import seaborn as sns

heatmap_data = anova_results.set_index('Feature')[['F_value', 'p_value']]
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt=".2e")
plt.title('ANOVA F-value and p-value Heatmap for Features')
plt.show()

Skills and Experience Gained

Data Cleaning: Learned to handle missing values, standardize data, and ensure consistency.
Feature Engineering: Gained experience in transforming and encoding features for machine learning.
Model Evaluation: Used multiple regression metrics to assess model performance.
Statistical Analysis: Applied ANOVA to interpret feature importance and guide model refinement.
Visualization: Created clear plots to communicate results and insights.
Critical Thinking: Recognized the importance of checking for data leakage and overfitting.

Conclusion

This project was a comprehensive exercise in the end-to-end machine learning workflow, from raw data to actionable insights. The experience reinforced the importance of data preparation, careful model evaluation, and statistical interpretation in building robust predictive models.

**Thanks for reading! If you have questions or want to discuss more about data science and machine learning, feel free to