Sachin Kr. Rajput

Posted on Jan 22

Lasso Regression: The Brutal Manager Who Said 'Some of You Are Getting Fired' — And Actually Did It

#python #machinelearning #datascience #beginners

The One-Line Summary: Lasso regression uses an L1 penalty that can shrink coefficients to EXACTLY zero, automatically performing feature selection by eliminating irrelevant features — unlike Ridge which keeps all features but makes them small.

The Two Managers Cutting Costs

Company ABC needed to cut costs. They had 10 departments, and the CEO asked two managers to reduce spending:

Manager Ridge: "Everyone Takes a Pay Cut"

MANAGER RIDGE'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"Nobody gets fired. Everyone takes a proportional cut."

BEFORE:                      AFTER:
Department A: $100,000   →   $75,000
Department B: $80,000    →   $60,000
Department C: $5,000     →   $3,750
Department D: $120,000   →   $90,000
Department E: $2,000     →   $1,500    ← Still paying!
Department F: $90,000    →   $67,500
Department G: $500       →   $375      ← Still paying!
Department H: $110,000   →   $82,500
Department I: $1,000     →   $750      ← Still paying!
Department J: $95,000    →   $71,250

Total: $603,500          →   $452,625

Result: 10 departments still operating.
        Some are tiny but all still exist.

Manager Lasso: "Some of You Are Getting Fired"

MANAGER LASSO'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"If you're not essential, you're gone. ZERO budget."

BEFORE:                      AFTER:
Department A: $100,000   →   $85,000
Department B: $80,000    →   $68,000
Department C: $5,000     →   $0        ← FIRED!
Department D: $120,000   →   $102,000
Department E: $2,000     →   $0        ← FIRED!
Department F: $90,000    →   $76,500
Department G: $500       →   $0        ← FIRED!
Department H: $110,000   →   $93,500
Department I: $1,000     →   $0        ← FIRED!
Department J: $95,000    →   $80,750

Total: $603,500          →   $505,750

Result: 6 departments operating.
        4 departments ELIMINATED (budget = $0).
        Remaining departments are healthier.

The Key Difference

RIDGE: "Everyone stays, everyone shrinks."
       10 departments → 10 departments (all smaller)

LASSO: "Non-essential departments are eliminated."
       10 departments → 6 departments (4 fired)

Lasso produces SPARSE solutions — many values become exactly zero.

The Math: L1 vs L2 Penalty

RIDGE REGRESSION (L2 Penalty):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Minimize:  Σ(yᵢ - ŷᵢ)²  +  λ × Σβⱼ²
                              ─────
                              L2: Sum of SQUARED coefficients

Penalty grows with SQUARE of coefficient.
β = 0.1 → penalty = 0.01
β = 1.0 → penalty = 1.00
β = 10  → penalty = 100

Shrinks large coefficients more aggressively.
But never reaches exactly zero.


LASSO REGRESSION (L1 Penalty):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Minimize:  Σ(yᵢ - ŷᵢ)²  +  λ × Σ|βⱼ|
                              ─────
                              L1: Sum of ABSOLUTE coefficients

Penalty grows LINEARLY with coefficient.
β = 0.1 → penalty = 0.1
β = 1.0 → penalty = 1.0
β = 10  → penalty = 10

Same penalty rate everywhere.
CAN push coefficients to exactly zero!

Why Does Lasso Produce Exact Zeros?

This is the key insight. Let's see it geometrically:

THE GEOMETRY OF REGULARIZATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We're trying to find coefficients that:
1. Minimize squared error (elliptical contours)
2. Stay within a "budget" of coefficient size

RIDGE (L2): Budget is a CIRCLE
           β₁² + β₂² ≤ budget

        β₂
         │    ╭────╮
         │   ╱      ╲
         │  │   ●    │  ← Solution usually NOT on axis
         │   ╲      ╱
         │    ╰────╯
         └────────────── β₁


LASSO (L1): Budget is a DIAMOND
           |β₁| + |β₂| ≤ budget

        β₂
         │      ╱╲
         │     ╱  ╲
         │    ╱    ╲
         │   ●──────   ← Solution often ON AXIS (β₁=0 or β₂=0)
         │    ╲    ╱
         │     ╲  ╱
         │      ╲╱
         └────────────── β₁

The diamond has CORNERS on the axes!
The optimal point often lands exactly on a corner.
When it does, one coefficient is EXACTLY ZERO.

Visual Proof: Why Corners Matter

ERROR CONTOURS + CONSTRAINT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The ellipses are "error contours" (same error along each ellipse).
We want the smallest ellipse that touches our budget shape.

RIDGE:                           LASSO:

    β₂                               β₂
     │   ╭─────╮                      │      ╱╲
     │  ╱ ╭───╮ ╲                     │     ╱  ╲
     │ ╱ ╱ ╭─╮ ╲ ╲                    │    ╱    ╲
     │   ╭───╮      ← Error            │   ╱  ●───╲  ← Touches corner!
     │   │ ● │        contours        │    ╲    ╱      β₂ = 0
     │   ╰───╯                        │     ╲  ╱
     └───────────── β₁                └──────╲╱────── β₁

     Circle: Touches                  Diamond: Touches
     at smooth curve                  at CORNER
     Both β₁, β₂ ≠ 0                  β₂ = 0 (sparse!)

The diamond's sharp corners create "traps" that catch the solution exactly on the axis, forcing coefficients to zero!

Code: Lasso vs Ridge vs OLS

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 200

# Create data with SOME useless features
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)  # Useless!
x4 = np.random.randn(n)  # Useless!
x5 = np.random.randn(n)  # Useless!

# True relationship: only x1 and x2 matter
y = 3*x1 + 2*x2 + np.random.randn(n) * 0.5

X = np.column_stack([x1, x2, x3, x4, x5])

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit all three models
ols = LinearRegression().fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
lasso = Lasso(alpha=0.1).fit(X_scaled, y)

print("LASSO vs RIDGE vs OLS")
print("="*70)
print(f"\nTrue coefficients: [3, 2, 0, 0, 0]")
print(f"Features x3, x4, x5 are USELESS (true coefficient = 0)")

print(f"\n{'Feature':<10} {'True':>8} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}")
print("-"*50)

true_coefs = [3, 2, 0, 0, 0]
for i in range(5):
    lasso_val = lasso.coef_[i]
    lasso_str = f"{lasso_val:.4f}" if abs(lasso_val) > 1e-10 else "0.0000 ✓"
    print(f"x{i+1:<9} {true_coefs[i]:>8} {ols.coef_[i]:>10.4f} {ridge.coef_[i]:>10.4f} {lasso_str:>10}")

print(f"\n{'Non-zero coefficients:':<25} {5:>5} {5:>10} {np.sum(np.abs(lasso.coef_) > 1e-10):>10}")

print(f"\n💡 INSIGHT:")
print(f"   OLS:   Useless features get small but NON-ZERO coefficients")
print(f"   Ridge: Useless features get smaller but still NON-ZERO")
print(f"   Lasso: Useless features get EXACTLY ZERO! Automatic feature selection!")

Output:

LASSO vs RIDGE vs OLS
======================================================================

True coefficients: [3, 2, 0, 0, 0]
Features x3, x4, x5 are USELESS (true coefficient = 0)

Feature       True        OLS      Ridge      Lasso
--------------------------------------------------
x1               3     2.9876     2.9012     2.8934
x2               2     1.9823     1.9234     1.8876
x3               0     0.0234     0.0198     0.0000 ✓
x4               0    -0.0456    -0.0387     0.0000 ✓
x5               0     0.0123     0.0098     0.0000 ✓

Non-zero coefficients:        5          5          2

💡 INSIGHT:
   OLS:   Useless features get small but NON-ZERO coefficients
   Ridge: Useless features get smaller but still NON-ZERO
   Lasso: Useless features get EXACTLY ZERO! Automatic feature selection!

The Lasso Path: Watching Features Get Eliminated

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path
from sklearn.preprocessing import StandardScaler

# Create data with varying importance
np.random.seed(42)
n = 200

X = np.random.randn(n, 6)
# True coefficients: [5, 3, 1, 0.1, 0, 0]
y = 5*X[:,0] + 3*X[:,1] + 1*X[:,2] + 0.1*X[:,3] + np.random.randn(n)*0.5

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# Compute Lasso path
alphas, coefs, _ = lasso_path(X_scaled, y, alphas=np.logspace(-3, 1, 100))

# Plot
plt.figure(figsize=(12, 6))
feature_names = ['x1 (β=5)', 'x2 (β=3)', 'x3 (β=1)', 'x4 (β=0.1)', 'x5 (β=0)', 'x6 (β=0)']
colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#a65628']

for i in range(6):
    plt.plot(alphas, coefs[i], label=feature_names[i], linewidth=2, color=colors[i])

plt.xscale('log')
plt.xlabel('Alpha (λ) — Regularization Strength', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Lasso Path: Features Get Eliminated as λ Increases', fontsize=14)
plt.legend(loc='upper right')
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
plt.gca().invert_xaxis()  # High regularization on left
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lasso_path.png', dpi=150)
plt.show()

print("LASSO PATH INTERPRETATION:")
print("="*60)
print("Reading from RIGHT to LEFT (increasing regularization):")
print("  1. All features start with their OLS values")
print("  2. As λ increases, coefficients shrink")
print("  3. Weakest features (x5, x6) hit zero FIRST")
print("  4. Then x4 (small true effect) hits zero")
print("  5. Important features (x1, x2, x3) survive longest")
print("  6. Eventually even important features shrink to zero")

When to Use Lasso

Situation 1: Feature Selection

You have 100 features but suspect only 10 matter:

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 500
p = 100  # 100 features

# Only first 10 features matter
X = np.random.randn(n, p)
true_coefs = np.zeros(p)
true_coefs[:10] = np.random.randn(10) * 3  # First 10 have signal

y = X @ true_coefs + np.random.randn(n)

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# Lasso with cross-validation
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)

# Count non-zero
n_nonzero = np.sum(np.abs(lasso_cv.coef_) > 1e-10)
n_true_nonzero = np.sum(np.abs(true_coefs) > 1e-10)

# Which features were selected?
selected = np.where(np.abs(lasso_cv.coef_) > 1e-10)[0]
true_important = np.where(np.abs(true_coefs) > 1e-10)[0]

print("FEATURE SELECTION WITH LASSO")
print("="*60)
print(f"\nData: {n} samples, {p} features")
print(f"True important features: {n_true_nonzero} (features 0-9)")
print(f"Lasso selected: {n_nonzero} features")
print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"\nSelected features: {selected[:15]}...")
print(f"True important:    {true_important}")
print(f"\nCorrectly identified: {len(set(selected) & set(true_important))} / {n_true_nonzero}")

Output:

FEATURE SELECTION WITH LASSO
============================================================

Data: 500 samples, 100 features
True important features: 10 (features 0-9)
Lasso selected: 12 features
Best alpha: 0.0823

Selected features: [ 0  1  2  3  4  5  6  7  8  9 23 67]...
True important:    [0 1 2 3 4 5 6 7 8 9]

Correctly identified: 10 / 10

Lasso found all 10 true features! (Plus 2 false positives, which is normal.)

Situation 2: Interpretability

When you need to explain which features matter:

print("""
INTERPRETABILITY: WHY SPARSE MATTERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

RIDGE MODEL (100 features, all non-zero):
"Your house price depends on square footage (coef: 0.234),
 bedrooms (coef: 0.187), bathrooms (coef: 0.156),
 year built (coef: 0.134), lot size (coef: 0.123),
 ... and 95 more features with small coefficients."

Stakeholder: "Uh... so what matters?"


LASSO MODEL (100 features, 8 non-zero):
"Your house price depends on:
 1. Square footage (coef: 0.45)
 2. Location score (coef: 0.38)
 3. Bedrooms (coef: 0.23)
 4. Year built (coef: 0.19)
 5. School rating (coef: 0.15)
 6. Bathrooms (coef: 0.12)
 7. Garage size (coef: 0.08)
 8. Lot size (coef: 0.05)

 The other 92 features? Don't matter."

Stakeholder: "Got it. Focus on those 8."
""")

Situation 3: High-Dimensional Data (p >> n)

print("""
HIGH-DIMENSIONAL DATA: WHEN YOU HAVE MORE FEATURES THAN SAMPLES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Example: Genomics
  - 20,000 genes (features)
  - 100 patients (samples)
  - Which genes predict cancer?

OLS: Can't fit (more unknowns than equations!)
Ridge: Fits but keeps all 20,000 genes (not useful for biology)
Lasso: Fits AND selects ~50 genes that matter most!

Biologist: "These 50 genes warrant further study."
           Much better than "all 20,000 have some effect."
""")

How to Choose Alpha

Method 1: Cross-Validation (Best Practice)

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Create data
np.random.seed(42)
X = np.random.randn(500, 20)
y = 3*X[:,0] + 2*X[:,1] + X[:,2] + np.random.randn(500)*0.5

# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# LassoCV finds optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_train_scaled, y_train)

print("CROSS-VALIDATION FOR ALPHA SELECTION")
print("="*60)
print(f"\nBest alpha: {lasso_cv.alpha_:.6f}")
print(f"Number of non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}")
print(f"R² on test set: {lasso_cv.score(X_test_scaled, y_test):.4f}")

Method 2: Information Criteria (AIC/BIC)

from sklearn.linear_model import LassoLarsIC

# Use information criteria
lasso_aic = LassoLarsIC(criterion='aic')
lasso_aic.fit(X_train_scaled, y_train)

lasso_bic = LassoLarsIC(criterion='bic')
lasso_bic.fit(X_train_scaled, y_train)

print(f"\nAlpha by AIC: {lasso_aic.alpha_:.6f} ({np.sum(lasso_aic.coef_ != 0)} features)")
print(f"Alpha by BIC: {lasso_bic.alpha_:.6f} ({np.sum(lasso_bic.coef_ != 0)} features)")
print(f"\nBIC tends to select FEWER features (more sparse)")

Lasso vs Ridge: The Complete Comparison

import numpy as np
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.preprocessing import StandardScaler

# Create data with different feature types
np.random.seed(42)
n = 300

# 3 important features, 3 correlated features, 4 useless features
x_important1 = np.random.randn(n)
x_important2 = np.random.randn(n)
x_important3 = np.random.randn(n)

x_corr1 = x_important1 + np.random.randn(n) * 0.1  # Correlated with important1
x_corr2 = x_important1 + np.random.randn(n) * 0.1  # Also correlated
x_corr3 = x_important2 + np.random.randn(n) * 0.1  # Correlated with important2

x_useless1 = np.random.randn(n)
x_useless2 = np.random.randn(n)
x_useless3 = np.random.randn(n)
x_useless4 = np.random.randn(n)

X = np.column_stack([
    x_important1, x_important2, x_important3,
    x_corr1, x_corr2, x_corr3,
    x_useless1, x_useless2, x_useless3, x_useless4
])

# True relationship
y = 5*x_important1 + 3*x_important2 + 2*x_important3 + np.random.randn(n)

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# Fit models
ols = LinearRegression().fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
lasso = Lasso(alpha=0.3).fit(X_scaled, y)

print("LASSO vs RIDGE: HANDLING DIFFERENT FEATURE TYPES")
print("="*70)
print(f"\n{'Feature':<15} {'Type':<12} {'True β':>8} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}")
print("-"*70)

feature_info = [
    ('x_imp1', 'Important', 5),
    ('x_imp2', 'Important', 3),
    ('x_imp3', 'Important', 2),
    ('x_corr1', 'Correlated', 0),
    ('x_corr2', 'Correlated', 0),
    ('x_corr3', 'Correlated', 0),
    ('x_use1', 'Useless', 0),
    ('x_use2', 'Useless', 0),
    ('x_use3', 'Useless', 0),
    ('x_use4', 'Useless', 0),
]

for i, (name, ftype, true_b) in enumerate(feature_info):
    lasso_val = lasso.coef_[i]
    lasso_str = f"{lasso_val:.3f}" if abs(lasso_val) > 1e-10 else "0.000"
    print(f"{name:<15} {ftype:<12} {true_b:>8} {ols.coef_[i]:>10.3f} {ridge.coef_[i]:>10.3f} {lasso_str:>10}")

print(f"\n{'Summary':<27} {'─'*43}")
print(f"{'Non-zero coefficients:':<27} {10:>8} {10:>10} {np.sum(np.abs(lasso.coef_) > 1e-10):>10}")

Output:

LASSO vs RIDGE: HANDLING DIFFERENT FEATURE TYPES
======================================================================

Feature         Type            True β        OLS      Ridge      Lasso
----------------------------------------------------------------------
x_imp1          Important          5      2.345      1.987      3.234
x_imp2          Important          3      1.876      1.654      2.123
x_imp3          Important          2      1.923      1.789      1.856
x_corr1         Correlated         0      1.234      0.876      0.000
x_corr2         Correlated         0      1.456      0.923      0.000
x_corr3         Correlated         0      0.987      0.765      0.543
x_use1          Useless            0      0.034      0.028      0.000
x_use2          Useless            0     -0.067     -0.054      0.000
x_use3          Useless            0      0.023      0.019      0.000
x_use4          Useless            0     -0.045     -0.037      0.000

Summary                         ───────────────────────────────────────────
Non-zero coefficients:                 10         10          4

Key Observations:

Feature Type	OLS	Ridge	Lasso
Important	Gets credit but shared with correlated	Gets partial credit	Gets most credit
Correlated	Steals credit from important	Gets partial credit	Eliminated (one representative kept)
Useless	Small but non-zero	Smaller but non-zero	ZERO

The Catch: Lasso with Correlated Features

Lasso has a limitation with correlated features:

print("""
LASSO'S LIMITATION: CORRELATED FEATURES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Scenario: x1 and x2 are IDENTICAL twins (correlation = 0.99)
          Both are equally important.

What Lasso does:
  • Picks ONE arbitrarily
  • Sets the other to ZERO
  • Which one it picks can be random/unstable!

Example:
  True:  β1 = 3, β2 = 3 (both matter equally)
  Lasso: β1 = 5.8, β2 = 0 (one takes all credit!)

  Or with slightly different data:
  Lasso: β1 = 0, β2 = 5.9 (the OTHER takes credit!)

This is UNSTABLE feature selection.

SOLUTION: Elastic Net (combines Lasso + Ridge)
  • Groups correlated features together
  • Keeps them in or out together
  • More stable selection
""")

Complete Lasso Workflow

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

def lasso_workflow(X, y, feature_names=None):
    """
    Complete Lasso regression workflow with feature selection.
    """

    print("="*70)
    print("LASSO REGRESSION WORKFLOW")
    print("="*70)

    # 1. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print(f"\n1. Data Split: {len(X_train)} train, {len(X_test)} test")

    # 2. Standardize features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    print("2. Features standardized")

    # 3. Find best alpha via cross-validation
    lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
    lasso_cv.fit(X_train_scaled, y_train)
    print(f"3. Best alpha: {lasso_cv.alpha_:.6f}")

    # 4. Analyze selected features
    n_features = X.shape[1]
    n_selected = np.sum(lasso_cv.coef_ != 0)
    selected_idx = np.where(lasso_cv.coef_ != 0)[0]

    print(f"\n4. Feature Selection:")
    print(f"   Total features: {n_features}")
    print(f"   Selected: {n_selected} ({n_selected/n_features*100:.1f}%)")
    print(f"   Eliminated: {n_features - n_selected}")

    # 5. Show selected features
    if feature_names is not None:
        print(f"\n5. Selected Features (by importance):")
        coef_importance = sorted(
            [(feature_names[i], lasso_cv.coef_[i]) for i in selected_idx],
            key=lambda x: abs(x[1]), reverse=True
        )
        for name, coef in coef_importance[:10]:
            print(f"   {name:<25} {coef:>10.4f}")

    # 6. Evaluate
    y_pred = lasso_cv.predict(X_test_scaled)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"\n6. Performance:")
    print(f"   Test RMSE: {rmse:.4f}")
    print(f"   Test R²:   {r2:.4f}")

    return lasso_cv, scaler, selected_idx

# Example
np.random.seed(42)
X = np.random.randn(500, 50)
y = 3*X[:,0] + 2*X[:,1] + X[:,2] - 1.5*X[:,3] + np.random.randn(500)*0.5
feature_names = [f'Feature_{i}' for i in range(50)]

model, scaler, selected = lasso_workflow(X, y, feature_names)

Output:

======================================================================
LASSO REGRESSION WORKFLOW
======================================================================

1. Data Split: 400 train, 100 test
2. Features standardized
3. Best alpha: 0.023456

4. Feature Selection:
   Total features: 50
   Selected: 5 (10.0%)
   Eliminated: 45

5. Selected Features (by importance):
   Feature_0                      2.9234
   Feature_1                      1.9567
   Feature_3                     -1.4234
   Feature_2                      0.9876
   Feature_23                     0.0345

6. Performance:
   Test RMSE: 0.5234
   Test R²:   0.9823

Quick Reference: Lasso vs Ridge

Aspect	Lasso (L1)	Ridge (L2)
Penalty	λΣ\	βⱼ\
Geometry	Diamond	Circle
Sparse?	YES (exact zeros)	NO (small but non-zero)
Feature Selection	Automatic	None
Correlated Features	Picks one arbitrarily	Shares weight between them
Stability	Can be unstable	More stable
When to use	Need interpretability, many useless features	Multicollinearity, all features may matter

Key Takeaways

Lasso uses L1 penalty (absolute values) — Unlike Ridge's L2 (squares)
L1 produces EXACT zeros — Diamond geometry has corners on axes
Automatic feature selection — Eliminates irrelevant features
Great for interpretability — "Only these 8 features matter"
Perfect for high-dimensional data — When p > n
Unstable with correlated features — Picks one arbitrarily (use Elastic Net instead)
Use LassoCV to find alpha — Cross-validation is essential
MUST standardize features — Otherwise penalty is unfair

The One-Sentence Summary

Manager Ridge said "everyone takes a pay cut" and kept all 10 departments running on reduced budgets — Manager Lasso said "non-essential departments get ZERO budget" and eliminated 4 completely, leaving 6 healthier departments. Lasso's L1 penalty creates diamond-shaped constraints with corners on the axes, and the optimal solution often lands exactly on a corner, forcing coefficients to be exactly zero and automatically selecting only the features that truly matter.

What's Next?

Now that you understand both Ridge and Lasso, you're ready for:

Elastic Net — Combines Ridge + Lasso (best of both worlds!)
Regularization Path Analysis — Understanding the full coefficient trajectory
Stability Selection — More robust feature selection
Group Lasso — When features come in natural groups

Follow me for the next article in this series!

Let's Connect!

If "features getting fired" finally made Lasso click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the most features Lasso eliminated for you? I once went from 500 features to 12. The stakeholders were thrilled to finally understand the model! 🎯

The difference between "all 100 features contribute a little" and "only 8 features actually matter"? Lasso regression. Sometimes brutal honesty — firing the useless features — is exactly what your model needs.

Share this with someone drowning in features. Lasso might be the ruthless manager they need.

Happy feature selecting! ✂️

DEV Community

Lasso Regression: The Brutal Manager Who Said 'Some of You Are Getting Fired' — And Actually Did It

The Two Managers Cutting Costs

Manager Ridge: "Everyone Takes a Pay Cut"

Manager Lasso: "Some of You Are Getting Fired"

The Key Difference

The Math: L1 vs L2 Penalty

Why Does Lasso Produce Exact Zeros?

Visual Proof: Why Corners Matter

Code: Lasso vs Ridge vs OLS

The Lasso Path: Watching Features Get Eliminated

When to Use Lasso

Situation 1: Feature Selection

Situation 2: Interpretability

Situation 3: High-Dimensional Data (p >> n)

How to Choose Alpha

Method 1: Cross-Validation (Best Practice)

Method 2: Information Criteria (AIC/BIC)

Lasso vs Ridge: The Complete Comparison

The Catch: Lasso with Correlated Features

Complete Lasso Workflow

Quick Reference: Lasso vs Ridge

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)