DEV Community

Cover image for Ridge Regression vs Lasso Regression
Maureen Muthoni
Maureen Muthoni

Posted on

Ridge Regression vs Lasso Regression

Introduction

Linear regression stands as one of the most fundamental tools in a data scientist's toolkit. At its core lies Ordinary Least Squares (OLS), a method that estimates model parameters by minimizing the sum of squared differences between predicted and actual values. In many real world problems, such as house price prediction datasets often contain many features, correlated variables, and noisy inputs. In such cases, traditional Ordinary Least Squares (OLS) regression becomes unstable and prone to overfitting. To solve these challenges, regularization techniques are used. The two most important regularization based models are:

  • Ridge Regression (L2 Regularization)
  • Lasso Regression (L1 Regularization)

Ordinary Least Squares (OLS)

Ordinary Least Squares estimates model parameters by minimizing the sum of squared residuals between predicted and actual values:

i=1∑n​(yi​−y^​i​)2

where y^i represents predicted prices.
OLS works well for small, clean datasets, but struggles when:

  • There are many features
  • Features are highly correlated (multicollinearity)
  • Data contains noise

This leads to overfitting, where the model performs well on training data but poorly on unseen data.

Regularization in Linear Regression

By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. The model now has to weigh accuracy against simplicity rather than just minimising error. The model now has to weigh accuracy against simplicity rather than just minimising error. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data.

General form: Loss = Error + Penalty

Ridge Regression (L2 Regularization)

Ridge regression modifies the OLS loss function by adding an L2 penalty term proportional to the sum of squared coefficients.

Ridge Regression Loss Function:
Minimize: RSS + λΣβⱼ² = Σ(yᵢ - ŷᵢ)² + λ(β₁² + β₂² + ... + βₚ²)

Where:

  • λ (lambda) = regularization parameter (λ ≥ 0)
  • The penalty term is the sum of squared coefficients
  • Note: The intercept β₀ is typically not penalised.

Conceptual Effect

  • Shrinks coefficients smoothly
  • Reduces model variance
  • Keeps all features
  • Handles multicollinearity well

Key Property

Ridge regression does not perform feature selection because coefficients are reduced but never become exactly zero.
Python Example:

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

y_pred_ridge = ridge.predict(X_test_scaled
Enter fullscreen mode Exit fullscreen mode

Lasso Regression (L1 Regularization)

Lasso takes a different approach through L1 regularization. Its loss function penalizes the sum of absolute coefficient values rather than squared values.

Lasso Regression Loss Function:
Minimize: RSS + λΣ|βⱼ| = Σ(yᵢ - ŷᵢ)² + λ(|β₁| + |β₂| + ... + |βₚ|)

Where:
The penalty term is the sum of absolute values of coefficients
λ controls the strength of regularization.

Conceptual Effect

  • Creates sparse models
  • Forces some coefficients to exactly zero
  • Automatically removes weak features

Key Property
Lasso performs feature selection, producing simpler and more interpretable models.
Python Example:

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

y_pred_lasso = lasso.predict(X_test_scaled)
Enter fullscreen mode Exit fullscreen mode

Comparing Ridge and Lasso

1. Feature Selection Capability
Ridge retains all features with shrunken coefficients, while Lasso performs automatic selection by zeroing out irrelevant features.
2. Coefficient Behavior with Correlated Features
When size (sq ft) and number of rooms correlate at r = 0.85:

Ridge: Size = $120/sq ft, Rooms = $8,000/room (both moderate)
Lasso: Size = $180/sq ft, Rooms = $0 (picks one, drops other)

Ridge distributes weight smoothly; Lasso makes discrete choices.
3. Model Interpretability
Ridge model: "Price depends on all 10 factors with varying importance."
Lasso model: "Price primarily depends on size, location, and age, other factors don't matter."
Lasso produces simpler, more explainable models for stakeholders.

Application Scenario: House Price Prediction

Suppose your dataset includes:

  • House size
  • Number of bedrooms
  • Distance to the city
  • Number of nearby schools
  • Several noisy or weak features

When to use Ridge
Choose Ridge if:

  • Most features likely influence price
  • Multicollinearity exists
  • You want stable predictions

When to use Lasso
Choose Lasso if:

  • Only a few features truly matter
  • Many variables add noise
  • Interpretability is important

Python Implementation
Data Preparation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error


X = df[['size', 'bedrooms', 'distance_city', 'schools_nearby', 'noise_feature']]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Enter fullscreen mode Exit fullscreen mode

OLS Model

ols = LinearRegression()
ols.fit(X_train_scaled, y_train)

y_pred_ols = ols.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_ols)
Enter fullscreen mode Exit fullscreen mode

Ridge Regression

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

y_pred_ridge = ridge.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_ridge)
Enter fullscreen mode Exit fullscreen mode

Lasso Regression

lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

y_pred_lasso = lasso.predict(X_test_scaled)
mean_squared_error(y_test, y_pred_lasso)
Enter fullscreen mode Exit fullscreen mode

Choosing the Right Model for House Prices
If all features contribute meaningfully (e.g., size, bedrooms, schools, distance):
Ridge Regression is preferred.
If only a few features are truly important and others add noise:
Lasso Regression is more suitable due to its feature selection capability.

Model Evaluation and Overfitting Detection

Overfitting can be detected by comparing training and testing performance:

  • High training score but low test score indicates overfitting
  • Similar training and test scores suggest good generalization Residual analysis also plays a key role. Residuals should be randomly distributed; visible patterns may indicate missing variables or non-linear relationships.

Conclusion

OLS is simple but prone to overfitting in complex datasets. Ridge and Lasso regression introduce regularization to improve stability and generalization. Ridge is best when all features matter, while Lasso is preferred for sparse, interpretable models. Understanding when and how to apply these techniques is essential for both exams and real-world machine learning problems.

Top comments (0)