Maureen Muthoni

Posted on Feb 3

Ridge Regression vs Lasso Regression

#machinelearning #programming #discuss

Introduction

Linear regression stands as one of the most fundamental tools in a data scientist's toolkit. At its core lies Ordinary Least Squares (OLS), a method that estimates model parameters by minimizing the sum of squared differences between predicted and actual values. In many real world problems, such as house price prediction datasets often contain many features, correlated variables, and noisy inputs. In such cases, traditional Ordinary Least Squares (OLS) regression becomes unstable and prone to overfitting. To solve these challenges, regularization techniques are used. The two most important regularization based models are:

Ridge Regression (L2 Regularization)
Lasso Regression (L1 Regularization)

Ordinary Least Squares (OLS)

Ordinary Least Squares estimates model parameters by minimizing the sum of squared residuals between predicted and actual values:

i=1∑n(yi−y^i)2

where y^i represents predicted prices.
OLS works well for small, clean datasets, but struggles when:

There are many features
Features are highly correlated (multicollinearity)
Data contains noise

This leads to overfitting, where the model performs well on training data but poorly on unseen data.

Regularization in Linear Regression

By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. The model now has to weigh accuracy against simplicity rather than just minimising error. The model now has to weigh accuracy against simplicity rather than just minimising error. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data.

General form: Loss = Error + Penalty

Ridge Regression (L2 Regularization)

Ridge regression modifies the OLS loss function by adding an L2 penalty term proportional to the sum of squared coefficients.

Ridge Regression Loss Function:
Minimize: RSS + λΣβⱼ² = Σ(yᵢ - ŷᵢ)² + λ(β₁² + β₂² + ... + βₚ²)

Where:

λ (lambda) = regularization parameter (λ ≥ 0)
The penalty term is the sum of squared coefficients
Note: The intercept β₀ is typically not penalised.

Conceptual Effect

Shrinks coefficients smoothly
Reduces model variance
Keeps all features
Handles multicollinearity well

Key Property

Ridge regression does not perform feature selection because coefficients are reduced but never become exactly zero.
Python Example:

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

y_pred_ridge = ridge.predict(X_test_scaled

Lasso Regression (L1 Regularization)

Lasso takes a different approach through L1 regularization. Its loss function penalizes the sum of absolute coefficient values rather than squared values.

Lasso Regression Loss Function:
Minimize: RSS + λΣ|βⱼ| = Σ(yᵢ - ŷᵢ)² + λ(|β₁| + |β₂| + ... + |βₚ|)

Where:
The penalty term is the sum of absolute values of coefficients
λ controls the strength of regularization.

Conceptual Effect

Creates sparse models
Forces some coefficients to exactly zero
Automatically removes weak features

Key Property
Lasso performs feature selection, producing simpler and more interpretable models.
Python Example:

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

y_pred_lasso = lasso.predict(X_test_scaled)

Comparing Ridge and Lasso

1. Feature Selection Capability
Ridge retains all features with shrunken coefficients, while Lasso performs automatic selection by zeroing out irrelevant features.
2. Coefficient Behavior with Correlated Features
When size (sq ft) and number of rooms correlate at r = 0.85:

Ridge: Size = $120/sq ft, Rooms = $8,000/room (both moderate)
Lasso: Size = $180/sq ft, Rooms = $0 (picks one, drops other)

Ridge distributes weight smoothly; Lasso makes discrete choices.
3. Model Interpretability
Ridge model: "Price depends on all 10 factors with varying importance."
Lasso model: "Price primarily depends on size, location, and age, other factors don't matter."
Lasso produces simpler, more explainable models for stakeholders.

Application Scenario: House Price Prediction

Suppose your dataset includes:

House size
Number of bedrooms
Distance to the city
Number of nearby schools
Several noisy or weak features

When to use Ridge
Choose Ridge if:

Most features likely influence price
Multicollinearity exists
You want stable predictions

When to use Lasso
Choose Lasso if:

Only a few features truly matter
Many variables add noise
Interpretability is important

Python Implementation
Data Preparation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error


X = df[['size', 'bedrooms', 'distance_city', 'schools_nearby', 'noise_feature']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

OLS Model

ols = LinearRegression()
ols.fit(X_train_scaled, y_train)

y_pred_ols = ols.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_ols)

Ridge Regression

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

y_pred_ridge = ridge.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_ridge)

Lasso Regression

lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

y_pred_lasso = lasso.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_lasso)

Choosing the Right Model for House Prices
If all features contribute meaningfully (e.g., size, bedrooms, schools, distance):
Ridge Regression is preferred.
If only a few features are truly important and others add noise:
Lasso Regression is more suitable due to its feature selection capability.

Model Evaluation and Overfitting Detection

Overfitting can be detected by comparing training and testing performance:

High training score but low test score indicates overfitting
Similar training and test scores suggest good generalization Residual analysis also plays a key role. Residuals should be randomly distributed; visible patterns may indicate missing variables or non-linear relationships.

Conclusion

OLS is simple but prone to overfitting in complex datasets. Ridge and Lasso regression introduce regularization to improve stability and generalization. Ridge is best when all features matter, while Lasso is preferred for sparse, interpretable models. Understanding when and how to apply these techniques is essential for both exams and real-world machine learning problems.

DEV Community