Introduction
In this project, I tackled a classic machine learning problem: predicting house prices based on various property features. The journey involved real-world data cleaning, feature engineering, model building, and interpreting results using both standard regression metrics and ANOVA-based feature importance. Here’s a summary of my approach, key insights, and the skills I developed along the way.
Project Workflow
1. Data Cleaning
Real-world datasets are rarely perfect. My first step was to ensure the data was clean and consistent:
- Standardized column names by removing extra spaces, converting to lowercase, and replacing spaces with underscores.
- Handled missing values by filling numeric columns with their mean values.
-
Standardized categorical values (like
location
,furnishing
, andhouse_condition
) by correcting typos and ensuring consistent capitalization. - Converted categorical variables to numeric using one-hot encoding.
- Ensured all features were numeric and dropped or filled any remaining missing values.
- Removed duplicate rows to avoid bias in modeling.
2. Feature Engineering
- Derived new columns where useful (e.g., converting year built to house age).
- Prepared categorical features for modeling by encoding them numerically.
3. Model Building
- Model Used: Linear Regression from scikit-learn.
- Training/Test Split: 80% of the data was used for training, 20% for testing.
-
Evaluation Metrics:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R² Score
Results:
- MSE: 7.80e-22 (almost zero)
- RMSE: 2.79e-11 (almost zero)
- MAE: 1.74e-11 (almost zero)
- R² Score: 1.0 (perfect fit)
Note: Such perfect results are rare in real-world scenarios and may indicate a very simple dataset or potential data leakage. Always double-check your data pipeline!
4. Feature Importance with ANOVA
To understand which features most influence house prices, I used ANOVA (Analysis of Variance) via f_regression
from scikit-learn. This provided F-values and p-values for each feature, highlighting their statistical significance.
Key Insights from ANOVA:
-
Most Important Predictors:
-
Converted_datatype_for_price($)
,Size_sqft
,Converted_datatype_for_size_sqft
-
House_condition_New
,House_condition_Old
-
Has_pool
,Year_built
-
-
Moderately Important:
-
Furnishing_Semi-Furnished
,Lot_size
-
-
Not Significant:
-
Bath_rooms
,Garage_available
,Location_Urban
, and others
-
Visualization
I visualized the F-values and p-values using a heatmap to quickly identify the most influential features:
import matplotlib.pyplot as plt
import seaborn as sns
heatmap_data = anova_results.set_index('Feature')[['F_value', 'p_value']]
plt.figure(figsize=(10, 6))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt=".2e")
plt.title('ANOVA F-value and p-value Heatmap for Features')
plt.show()
Skills and Experience Gained
- Data Cleaning: Learned to handle missing values, standardize data, and ensure consistency.
- Feature Engineering: Gained experience in transforming and encoding features for machine learning.
- Model Evaluation: Used multiple regression metrics to assess model performance.
- Statistical Analysis: Applied ANOVA to interpret feature importance and guide model refinement.
- Visualization: Created clear plots to communicate results and insights.
- Critical Thinking: Recognized the importance of checking for data leakage and overfitting.
Conclusion
This project was a comprehensive exercise in the end-to-end machine learning workflow, from raw data to actionable insights. The experience reinforced the importance of data preparation, careful model evaluation, and statistical interpretation in building robust predictive models.
**Thanks for reading! If you have questions or want to discuss more about data science and machine learning, feel free to
Top comments (0)