Introduction
Linear regression stands as one of the most fundamental tools in a data scientist's toolkit. At its core lies Ordinary Least Squares (OLS), a method that estimates model parameters by minimizing the sum of squared differences between predicted and actual values. In many real world problems, such as house price prediction datasets often contain many features, correlated variables, and noisy inputs. In such cases, traditional Ordinary Least Squares (OLS) regression becomes unstable and prone to overfitting. To solve these challenges, regularization techniques are used. The two most important regularization based models are:
- Ridge Regression (L2 Regularization)
- Lasso Regression (L1 Regularization)
Ordinary Least Squares (OLS)
Ordinary Least Squares estimates model parameters by minimizing the sum of squared residuals between predicted and actual values:
i=1∑n(yi−y^i)2
where y^i represents predicted prices.
OLS works well for small, clean datasets, but struggles when:
- There are many features
- Features are highly correlated (multicollinearity)
- Data contains noise
This leads to overfitting, where the model performs well on training data but poorly on unseen data.
Regularization in Linear Regression
By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. The model now has to weigh accuracy against simplicity rather than just minimising error. The model now has to weigh accuracy against simplicity rather than just minimising error. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data.
General form: Loss = Error + Penalty
Ridge Regression (L2 Regularization)
Ridge regression modifies the OLS loss function by adding an L2 penalty term proportional to the sum of squared coefficients.
Ridge Regression Loss Function:
Minimize: RSS + λΣβⱼ² = Σ(yᵢ - ŷᵢ)² + λ(β₁² + β₂² + ... + βₚ²)
Where:
- λ (lambda) = regularization parameter (λ ≥ 0)
- The penalty term is the sum of squared coefficients
- Note: The intercept β₀ is typically not penalised.
Conceptual Effect
- Shrinks coefficients smoothly
- Reduces model variance
- Keeps all features
- Handles multicollinearity well
Key Property
Ridge regression does not perform feature selection because coefficients are reduced but never become exactly zero.
Python Example:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled
Lasso Regression (L1 Regularization)
Lasso takes a different approach through L1 regularization. Its loss function penalizes the sum of absolute coefficient values rather than squared values.
Lasso Regression Loss Function:
Minimize: RSS + λΣ|βⱼ| = Σ(yᵢ - ŷᵢ)² + λ(|β₁| + |β₂| + ... + |βₚ|)
Where:
The penalty term is the sum of absolute values of coefficients
λ controls the strength of regularization.
Conceptual Effect
- Creates sparse models
- Forces some coefficients to exactly zero
- Automatically removes weak features
Key Property
Lasso performs feature selection, producing simpler and more interpretable models.
Python Example:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)
Comparing Ridge and Lasso
1. Feature Selection Capability
Ridge retains all features with shrunken coefficients, while Lasso performs automatic selection by zeroing out irrelevant features.
2. Coefficient Behavior with Correlated Features
When size (sq ft) and number of rooms correlate at r = 0.85:
Ridge: Size = $120/sq ft, Rooms = $8,000/room (both moderate)
Lasso: Size = $180/sq ft, Rooms = $0 (picks one, drops other)
Ridge distributes weight smoothly; Lasso makes discrete choices.
3. Model Interpretability
Ridge model: "Price depends on all 10 factors with varying importance."
Lasso model: "Price primarily depends on size, location, and age, other factors don't matter."
Lasso produces simpler, more explainable models for stakeholders.
Application Scenario: House Price Prediction
Suppose your dataset includes:
- House size
- Number of bedrooms
- Distance to the city
- Number of nearby schools
- Several noisy or weak features
When to use Ridge
Choose Ridge if:
- Most features likely influence price
- Multicollinearity exists
- You want stable predictions
When to use Lasso
Choose Lasso if:
- Only a few features truly matter
- Many variables add noise
- Interpretability is important
Python Implementation
Data Preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
X = df[['size', 'bedrooms', 'distance_city', 'schools_nearby', 'noise_feature']]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
OLS Model
ols = LinearRegression()
ols.fit(X_train_scaled, y_train)
y_pred_ols = ols.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_ols)
Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_ridge)
Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_lasso)
Choosing the Right Model for House Prices
If all features contribute meaningfully (e.g., size, bedrooms, schools, distance):
Ridge Regression is preferred.
If only a few features are truly important and others add noise:
Lasso Regression is more suitable due to its feature selection capability.
Model Evaluation and Overfitting Detection
Overfitting can be detected by comparing training and testing performance:
- High training score but low test score indicates overfitting
- Similar training and test scores suggest good generalization Residual analysis also plays a key role. Residuals should be randomly distributed; visible patterns may indicate missing variables or non-linear relationships.
Conclusion
OLS is simple but prone to overfitting in complex datasets. Ridge and Lasso regression introduce regularization to improve stability and generalization. Ridge is best when all features matter, while Lasso is preferred for sparse, interpretable models. Understanding when and how to apply these techniques is essential for both exams and real-world machine learning problems.
Top comments (0)