Introduction
Predicting house prices is a common problem in data science. A house’s price is influenced by many factors such as its size, number of bedrooms, distance to the city, and nearby amenities. However, real-world datasets also contain noise — features that do not truly affect the price.
In this article, we use a house price dataset to explain:
- Ordinary Least Squares (OLS)
- Ridge Regression (L2 Regularization)
- Lasso Regression (L1 Regularization)
The goal is to understand how these models work, why regularization is needed, and which model to choose in practice, using simple explanations and visual support.
1. Loading the Dataset
We begin by loading the dataset into Python using common data science libraries. This step allows us to inspect the data and understand what features are available.
The dataset contains 500 houses and includes:
- House characteristics (size, bedrooms, bathrooms)
- Location information (distance to city)
- Social features (schools nearby)
Some noisy or weak features that do not meaningfully affect price
Figure 1: Loading the house price dataset and displaying sample rows.

2*. Understanding the Data*
Before building any model, it is important to understand the data.
Important Features
- size_sqm – size of the house
- bedrooms – number of bedrooms
- bathrooms – number of bathrooms
- distance_to_city – distance from city center
- schools_nearby – number of nearby schools
Noisy / Weak Features
- paint_color_code
- random_id_feature
- weather_noise
- street_code
- Target Variable
- price – the value we want to predict
Some features clearly make sense, while others are random and should not affect house price.
3. Splitting Features and Target
Next, we separate:
_Input features (X) – all columns except price
Target (y) – the house price_
This prepares the data for training machine learning models.
4. Train–Test Split
To evaluate model performance properly, we split the dataset into:
Training data– used to train the model
Testing data –used to evaluate performance on unseen data
This step is critical for detecting overfitting.
5. Ordinary Least Squares (OLS)
What is OLS?
Ordinary Least Squares is the most basic linear regression method. It finds model coefficients by minimizing the sum of squared differences between actual house prices and predicted prices.
Why OLS Can Overfit
OLS:
- Uses all features
- Assigns coefficients to both useful and noisy variables
- Can give large weights to irrelevant features
In our dataset, OLS may treat random_id_feature as important, even though it has no real meaning.
6. Detecting Overfitting with OLS
To check overfitting, we compare:
- Training performance
- Testing performance
If training accuracy is high but test accuracy is much lower, the model is overfitting.
- Regularization: Why We Need It**
Regularization adds a penalty to the loss function to control model complexity.
It helps:
- Reduce overfitting
- Shrink large coefficients
- Improve performance on unseen data
Two common regularization techniques are** Ridge and Lasso regression.
8. Ridge Regression (L2 Regularization)
How Ridge Works
Ridge regression adds an L2 penalty, which penalizes large coefficients by squaring them.
Ridge:
- Shrinks all coefficients
- Keeps all features
- Reduces sensitivity to noise
Ridge on Our Dataset
In our house price data:
Important features still have strong influence
Noisy features receive very small coefficients
No feature is completely removed
9. Lasso Regression (L1 Regularization)
How Lasso Works
Lasso regression uses an L1 penalty, which can shrink coefficients to zero.
This means:
- Unimportant features are removed
- The model becomes simpler and easier to interpret
Lasso on Our Dataset
After applying Lasso:
- Features like size_sqm and bedrooms remain
- Noisy features such as weather_noise and random_id_feature are set to zero
10. Ridge vs Lasso Comparison
| Aspect | Ridge Regression | Lasso Regression |
|---|---|---|
| Regularization | L2 | L1 |
| Feature selection | No | Yes |
| Handles noise | Shrinks | Removes |
| Interpretability | Lower | Higher |
11. Model Evaluation Using Residuals
Residuals are the differences between:
- Actual house prices
- Predicted house prices
By plotting residuals:
- Random scatter → good model
- Clear patterns → poor model fit
12. Choosing the Right Model
If all features are believed to matter
Choose Ridge Regression
- Keeps all features
- Controls overfitting
- Works well with correlated variables
If only a few features are important
Choose Lasso Regression
- Removes noisy features
- Produces a simpler model
- Easier to explain and interpret
Conclusion
Using the house price dataset, we observe that:
OLS is simple but prone to overfitting
Ridge regression improves stability by shrinking coefficients
Lasso regression simplifies the model by removing irrelevant features








Top comments (0)