DEV Community

Cover image for Lasso Regression: The Brutal Manager Who Said 'Some of You Are Getting Fired' — And Actually Did It
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Lasso Regression: The Brutal Manager Who Said 'Some of You Are Getting Fired' — And Actually Did It

The One-Line Summary: Lasso regression uses an L1 penalty that can shrink coefficients to EXACTLY zero, automatically performing feature selection by eliminating irrelevant features — unlike Ridge which keeps all features but makes them small.


The Two Managers Cutting Costs

Company ABC needed to cut costs. They had 10 departments, and the CEO asked two managers to reduce spending:


Manager Ridge: "Everyone Takes a Pay Cut"

MANAGER RIDGE'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"Nobody gets fired. Everyone takes a proportional cut."

BEFORE:                      AFTER:
Department A: $100,000   →   $75,000
Department B: $80,000    →   $60,000
Department C: $5,000     →   $3,750
Department D: $120,000   →   $90,000
Department E: $2,000     →   $1,500    ← Still paying!
Department F: $90,000    →   $67,500
Department G: $500       →   $375      ← Still paying!
Department H: $110,000   →   $82,500
Department I: $1,000     →   $750      ← Still paying!
Department J: $95,000    →   $71,250

Total: $603,500          →   $452,625

Result: 10 departments still operating.
        Some are tiny but all still exist.
Enter fullscreen mode Exit fullscreen mode

Manager Lasso: "Some of You Are Getting Fired"

MANAGER LASSO'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"If you're not essential, you're gone. ZERO budget."

BEFORE:                      AFTER:
Department A: $100,000   →   $85,000
Department B: $80,000    →   $68,000
Department C: $5,000     →   $0        ← FIRED!
Department D: $120,000   →   $102,000
Department E: $2,000     →   $0        ← FIRED!
Department F: $90,000    →   $76,500
Department G: $500       →   $0        ← FIRED!
Department H: $110,000   →   $93,500
Department I: $1,000     →   $0        ← FIRED!
Department J: $95,000    →   $80,750

Total: $603,500          →   $505,750

Result: 6 departments operating.
        4 departments ELIMINATED (budget = $0).
        Remaining departments are healthier.
Enter fullscreen mode Exit fullscreen mode

The Key Difference

RIDGE: "Everyone stays, everyone shrinks."
       10 departments → 10 departments (all smaller)

LASSO: "Non-essential departments are eliminated."
       10 departments → 6 departments (4 fired)
Enter fullscreen mode Exit fullscreen mode

Lasso produces SPARSE solutions — many values become exactly zero.


The Math: L1 vs L2 Penalty

RIDGE REGRESSION (L2 Penalty):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Minimize:  Σ(yᵢ - ŷᵢ)²  +  λ × Σβⱼ²
                              ─────
                              L2: Sum of SQUARED coefficients

Penalty grows with SQUARE of coefficient.
β = 0.1 → penalty = 0.01
β = 1.0 → penalty = 1.00
β = 10  → penalty = 100

Shrinks large coefficients more aggressively.
But never reaches exactly zero.


LASSO REGRESSION (L1 Penalty):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Minimize:  Σ(yᵢ - ŷᵢ)²  +  λ × Σ|βⱼ|
                              ─────
                              L1: Sum of ABSOLUTE coefficients

Penalty grows LINEARLY with coefficient.
β = 0.1 → penalty = 0.1
β = 1.0 → penalty = 1.0
β = 10  → penalty = 10

Same penalty rate everywhere.
CAN push coefficients to exactly zero!
Enter fullscreen mode Exit fullscreen mode

Why Does Lasso Produce Exact Zeros?

This is the key insight. Let's see it geometrically:

THE GEOMETRY OF REGULARIZATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We're trying to find coefficients that:
1. Minimize squared error (elliptical contours)
2. Stay within a "budget" of coefficient size

RIDGE (L2): Budget is a CIRCLE
           β₁² + β₂² ≤ budget

        β₂
         │    ╭────╮
         │   ╱      ╲
         │  │   ●    │  ← Solution usually NOT on axis
         │   ╲      ╱
         │    ╰────╯
         └────────────── β₁


LASSO (L1): Budget is a DIAMOND
           |β₁| + |β₂| ≤ budget

        β₂
         │      ╱╲
         │     ╱  ╲
         │    ╱    ╲
         │   ●──────   ← Solution often ON AXIS (β₁=0 or β₂=0)
         │    ╲    ╱
         │     ╲  ╱
         │      ╲╱
         └────────────── β₁

The diamond has CORNERS on the axes!
The optimal point often lands exactly on a corner.
When it does, one coefficient is EXACTLY ZERO.
Enter fullscreen mode Exit fullscreen mode

Visual Proof: Why Corners Matter

ERROR CONTOURS + CONSTRAINT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The ellipses are "error contours" (same error along each ellipse).
We want the smallest ellipse that touches our budget shape.

RIDGE:                           LASSO:

    β₂                               β₂
     │   ╭─────╮                      │      ╱╲
     │  ╱ ╭───╮ ╲                     │     ╱  ╲
     │ ╱ ╱ ╭─╮ ╲ ╲                    │    ╱    ╲
     │   ╭───╮      ← Error            │   ╱  ●───╲  ← Touches corner!
     │   │ ● │        contours        │    ╲    ╱      β₂ = 0
     │   ╰───╯                        │     ╲  ╱
     └───────────── β₁                └──────╲╱────── β₁

     Circle: Touches                  Diamond: Touches
     at smooth curve                  at CORNER
     Both β₁, β₂ ≠ 0                  β₂ = 0 (sparse!)
Enter fullscreen mode Exit fullscreen mode

The diamond's sharp corners create "traps" that catch the solution exactly on the axis, forcing coefficients to zero!


Code: Lasso vs Ridge vs OLS

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 200

# Create data with SOME useless features
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)  # Useless!
x4 = np.random.randn(n)  # Useless!
x5 = np.random.randn(n)  # Useless!

# True relationship: only x1 and x2 matter
y = 3*x1 + 2*x2 + np.random.randn(n) * 0.5

X = np.column_stack([x1, x2, x3, x4, x5])

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit all three models
ols = LinearRegression().fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
lasso = Lasso(alpha=0.1).fit(X_scaled, y)

print("LASSO vs RIDGE vs OLS")
print("="*70)
print(f"\nTrue coefficients: [3, 2, 0, 0, 0]")
print(f"Features x3, x4, x5 are USELESS (true coefficient = 0)")

print(f"\n{'Feature':<10} {'True':>8} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}")
print("-"*50)

true_coefs = [3, 2, 0, 0, 0]
for i in range(5):
    lasso_val = lasso.coef_[i]
    lasso_str = f"{lasso_val:.4f}" if abs(lasso_val) > 1e-10 else "0.0000 ✓"
    print(f"x{i+1:<9} {true_coefs[i]:>8} {ols.coef_[i]:>10.4f} {ridge.coef_[i]:>10.4f} {lasso_str:>10}")

print(f"\n{'Non-zero coefficients:':<25} {5:>5} {5:>10} {np.sum(np.abs(lasso.coef_) > 1e-10):>10}")

print(f"\n💡 INSIGHT:")
print(f"   OLS:   Useless features get small but NON-ZERO coefficients")
print(f"   Ridge: Useless features get smaller but still NON-ZERO")
print(f"   Lasso: Useless features get EXACTLY ZERO! Automatic feature selection!")
Enter fullscreen mode Exit fullscreen mode

Output:

LASSO vs RIDGE vs OLS
======================================================================

True coefficients: [3, 2, 0, 0, 0]
Features x3, x4, x5 are USELESS (true coefficient = 0)

Feature       True        OLS      Ridge      Lasso
--------------------------------------------------
x1               3     2.9876     2.9012     2.8934
x2               2     1.9823     1.9234     1.8876
x3               0     0.0234     0.0198     0.0000 ✓
x4               0    -0.0456    -0.0387     0.0000 ✓
x5               0     0.0123     0.0098     0.0000 ✓

Non-zero coefficients:        5          5          2

💡 INSIGHT:
   OLS:   Useless features get small but NON-ZERO coefficients
   Ridge: Useless features get smaller but still NON-ZERO
   Lasso: Useless features get EXACTLY ZERO! Automatic feature selection!
Enter fullscreen mode Exit fullscreen mode

The Lasso Path: Watching Features Get Eliminated

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path
from sklearn.preprocessing import StandardScaler

# Create data with varying importance
np.random.seed(42)
n = 200

X = np.random.randn(n, 6)
# True coefficients: [5, 3, 1, 0.1, 0, 0]
y = 5*X[:,0] + 3*X[:,1] + 1*X[:,2] + 0.1*X[:,3] + np.random.randn(n)*0.5

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# Compute Lasso path
alphas, coefs, _ = lasso_path(X_scaled, y, alphas=np.logspace(-3, 1, 100))

# Plot
plt.figure(figsize=(12, 6))
feature_names = ['x1 (β=5)', 'x2 (β=3)', 'x3 (β=1)', 'x4 (β=0.1)', 'x5 (β=0)', 'x6 (β=0)']
colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#a65628']

for i in range(6):
    plt.plot(alphas, coefs[i], label=feature_names[i], linewidth=2, color=colors[i])

plt.xscale('log')
plt.xlabel('Alpha (λ) — Regularization Strength', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Lasso Path: Features Get Eliminated as λ Increases', fontsize=14)
plt.legend(loc='upper right')
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
plt.gca().invert_xaxis()  # High regularization on left
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lasso_path.png', dpi=150)
plt.show()

print("LASSO PATH INTERPRETATION:")
print("="*60)
print("Reading from RIGHT to LEFT (increasing regularization):")
print("  1. All features start with their OLS values")
print("  2. As λ increases, coefficients shrink")
print("  3. Weakest features (x5, x6) hit zero FIRST")
print("  4. Then x4 (small true effect) hits zero")
print("  5. Important features (x1, x2, x3) survive longest")
print("  6. Eventually even important features shrink to zero")
Enter fullscreen mode Exit fullscreen mode

When to Use Lasso

Situation 1: Feature Selection

You have 100 features but suspect only 10 matter:

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 500
p = 100  # 100 features

# Only first 10 features matter
X = np.random.randn(n, p)
true_coefs = np.zeros(p)
true_coefs[:10] = np.random.randn(10) * 3  # First 10 have signal

y = X @ true_coefs + np.random.randn(n)

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# Lasso with cross-validation
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)

# Count non-zero
n_nonzero = np.sum(np.abs(lasso_cv.coef_) > 1e-10)
n_true_nonzero = np.sum(np.abs(true_coefs) > 1e-10)

# Which features were selected?
selected = np.where(np.abs(lasso_cv.coef_) > 1e-10)[0]
true_important = np.where(np.abs(true_coefs) > 1e-10)[0]

print("FEATURE SELECTION WITH LASSO")
print("="*60)
print(f"\nData: {n} samples, {p} features")
print(f"True important features: {n_true_nonzero} (features 0-9)")
print(f"Lasso selected: {n_nonzero} features")
print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"\nSelected features: {selected[:15]}...")
print(f"True important:    {true_important}")
print(f"\nCorrectly identified: {len(set(selected) & set(true_important))} / {n_true_nonzero}")
Enter fullscreen mode Exit fullscreen mode

Output:

FEATURE SELECTION WITH LASSO
============================================================

Data: 500 samples, 100 features
True important features: 10 (features 0-9)
Lasso selected: 12 features
Best alpha: 0.0823

Selected features: [ 0  1  2  3  4  5  6  7  8  9 23 67]...
True important:    [0 1 2 3 4 5 6 7 8 9]

Correctly identified: 10 / 10
Enter fullscreen mode Exit fullscreen mode

Lasso found all 10 true features! (Plus 2 false positives, which is normal.)


Situation 2: Interpretability

When you need to explain which features matter:

print("""
INTERPRETABILITY: WHY SPARSE MATTERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

RIDGE MODEL (100 features, all non-zero):
"Your house price depends on square footage (coef: 0.234),
 bedrooms (coef: 0.187), bathrooms (coef: 0.156),
 year built (coef: 0.134), lot size (coef: 0.123),
 ... and 95 more features with small coefficients."

Stakeholder: "Uh... so what matters?"


LASSO MODEL (100 features, 8 non-zero):
"Your house price depends on:
 1. Square footage (coef: 0.45)
 2. Location score (coef: 0.38)
 3. Bedrooms (coef: 0.23)
 4. Year built (coef: 0.19)
 5. School rating (coef: 0.15)
 6. Bathrooms (coef: 0.12)
 7. Garage size (coef: 0.08)
 8. Lot size (coef: 0.05)

 The other 92 features? Don't matter."

Stakeholder: "Got it. Focus on those 8."
""")
Enter fullscreen mode Exit fullscreen mode

Situation 3: High-Dimensional Data (p >> n)

print("""
HIGH-DIMENSIONAL DATA: WHEN YOU HAVE MORE FEATURES THAN SAMPLES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Example: Genomics
  - 20,000 genes (features)
  - 100 patients (samples)
  - Which genes predict cancer?

OLS: Can't fit (more unknowns than equations!)
Ridge: Fits but keeps all 20,000 genes (not useful for biology)
Lasso: Fits AND selects ~50 genes that matter most!

Biologist: "These 50 genes warrant further study."
           Much better than "all 20,000 have some effect."
""")
Enter fullscreen mode Exit fullscreen mode

How to Choose Alpha

Method 1: Cross-Validation (Best Practice)

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Create data
np.random.seed(42)
X = np.random.randn(500, 20)
y = 3*X[:,0] + 2*X[:,1] + X[:,2] + np.random.randn(500)*0.5

# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# LassoCV finds optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_train_scaled, y_train)

print("CROSS-VALIDATION FOR ALPHA SELECTION")
print("="*60)
print(f"\nBest alpha: {lasso_cv.alpha_:.6f}")
print(f"Number of non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}")
print(f"R² on test set: {lasso_cv.score(X_test_scaled, y_test):.4f}")
Enter fullscreen mode Exit fullscreen mode

Method 2: Information Criteria (AIC/BIC)

from sklearn.linear_model import LassoLarsIC

# Use information criteria
lasso_aic = LassoLarsIC(criterion='aic')
lasso_aic.fit(X_train_scaled, y_train)

lasso_bic = LassoLarsIC(criterion='bic')
lasso_bic.fit(X_train_scaled, y_train)

print(f"\nAlpha by AIC: {lasso_aic.alpha_:.6f} ({np.sum(lasso_aic.coef_ != 0)} features)")
print(f"Alpha by BIC: {lasso_bic.alpha_:.6f} ({np.sum(lasso_bic.coef_ != 0)} features)")
print(f"\nBIC tends to select FEWER features (more sparse)")
Enter fullscreen mode Exit fullscreen mode

Lasso vs Ridge: The Complete Comparison

import numpy as np
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.preprocessing import StandardScaler

# Create data with different feature types
np.random.seed(42)
n = 300

# 3 important features, 3 correlated features, 4 useless features
x_important1 = np.random.randn(n)
x_important2 = np.random.randn(n)
x_important3 = np.random.randn(n)

x_corr1 = x_important1 + np.random.randn(n) * 0.1  # Correlated with important1
x_corr2 = x_important1 + np.random.randn(n) * 0.1  # Also correlated
x_corr3 = x_important2 + np.random.randn(n) * 0.1  # Correlated with important2

x_useless1 = np.random.randn(n)
x_useless2 = np.random.randn(n)
x_useless3 = np.random.randn(n)
x_useless4 = np.random.randn(n)

X = np.column_stack([
    x_important1, x_important2, x_important3,
    x_corr1, x_corr2, x_corr3,
    x_useless1, x_useless2, x_useless3, x_useless4
])

# True relationship
y = 5*x_important1 + 3*x_important2 + 2*x_important3 + np.random.randn(n)

# Standardize
X_scaled = StandardScaler().fit_transform(X)

# Fit models
ols = LinearRegression().fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
lasso = Lasso(alpha=0.3).fit(X_scaled, y)

print("LASSO vs RIDGE: HANDLING DIFFERENT FEATURE TYPES")
print("="*70)
print(f"\n{'Feature':<15} {'Type':<12} {'True β':>8} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}")
print("-"*70)

feature_info = [
    ('x_imp1', 'Important', 5),
    ('x_imp2', 'Important', 3),
    ('x_imp3', 'Important', 2),
    ('x_corr1', 'Correlated', 0),
    ('x_corr2', 'Correlated', 0),
    ('x_corr3', 'Correlated', 0),
    ('x_use1', 'Useless', 0),
    ('x_use2', 'Useless', 0),
    ('x_use3', 'Useless', 0),
    ('x_use4', 'Useless', 0),
]

for i, (name, ftype, true_b) in enumerate(feature_info):
    lasso_val = lasso.coef_[i]
    lasso_str = f"{lasso_val:.3f}" if abs(lasso_val) > 1e-10 else "0.000"
    print(f"{name:<15} {ftype:<12} {true_b:>8} {ols.coef_[i]:>10.3f} {ridge.coef_[i]:>10.3f} {lasso_str:>10}")

print(f"\n{'Summary':<27} {''*43}")
print(f"{'Non-zero coefficients:':<27} {10:>8} {10:>10} {np.sum(np.abs(lasso.coef_) > 1e-10):>10}")
Enter fullscreen mode Exit fullscreen mode

Output:

LASSO vs RIDGE: HANDLING DIFFERENT FEATURE TYPES
======================================================================

Feature         Type            True β        OLS      Ridge      Lasso
----------------------------------------------------------------------
x_imp1          Important          5      2.345      1.987      3.234
x_imp2          Important          3      1.876      1.654      2.123
x_imp3          Important          2      1.923      1.789      1.856
x_corr1         Correlated         0      1.234      0.876      0.000
x_corr2         Correlated         0      1.456      0.923      0.000
x_corr3         Correlated         0      0.987      0.765      0.543
x_use1          Useless            0      0.034      0.028      0.000
x_use2          Useless            0     -0.067     -0.054      0.000
x_use3          Useless            0      0.023      0.019      0.000
x_use4          Useless            0     -0.045     -0.037      0.000

Summary                         ───────────────────────────────────────────
Non-zero coefficients:                 10         10          4
Enter fullscreen mode Exit fullscreen mode

Key Observations:

Feature Type OLS Ridge Lasso
Important Gets credit but shared with correlated Gets partial credit Gets most credit
Correlated Steals credit from important Gets partial credit Eliminated (one representative kept)
Useless Small but non-zero Smaller but non-zero ZERO

The Catch: Lasso with Correlated Features

Lasso has a limitation with correlated features:

print("""
LASSO'S LIMITATION: CORRELATED FEATURES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Scenario: x1 and x2 are IDENTICAL twins (correlation = 0.99)
          Both are equally important.

What Lasso does:
  • Picks ONE arbitrarily
  • Sets the other to ZERO
  • Which one it picks can be random/unstable!

Example:
  True:  β1 = 3, β2 = 3 (both matter equally)
  Lasso: β1 = 5.8, β2 = 0 (one takes all credit!)

  Or with slightly different data:
  Lasso: β1 = 0, β2 = 5.9 (the OTHER takes credit!)

This is UNSTABLE feature selection.

SOLUTION: Elastic Net (combines Lasso + Ridge)
  • Groups correlated features together
  • Keeps them in or out together
  • More stable selection
""")
Enter fullscreen mode Exit fullscreen mode

Complete Lasso Workflow

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

def lasso_workflow(X, y, feature_names=None):
    """
    Complete Lasso regression workflow with feature selection.
    """

    print("="*70)
    print("LASSO REGRESSION WORKFLOW")
    print("="*70)

    # 1. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print(f"\n1. Data Split: {len(X_train)} train, {len(X_test)} test")

    # 2. Standardize features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    print("2. Features standardized")

    # 3. Find best alpha via cross-validation
    lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
    lasso_cv.fit(X_train_scaled, y_train)
    print(f"3. Best alpha: {lasso_cv.alpha_:.6f}")

    # 4. Analyze selected features
    n_features = X.shape[1]
    n_selected = np.sum(lasso_cv.coef_ != 0)
    selected_idx = np.where(lasso_cv.coef_ != 0)[0]

    print(f"\n4. Feature Selection:")
    print(f"   Total features: {n_features}")
    print(f"   Selected: {n_selected} ({n_selected/n_features*100:.1f}%)")
    print(f"   Eliminated: {n_features - n_selected}")

    # 5. Show selected features
    if feature_names is not None:
        print(f"\n5. Selected Features (by importance):")
        coef_importance = sorted(
            [(feature_names[i], lasso_cv.coef_[i]) for i in selected_idx],
            key=lambda x: abs(x[1]), reverse=True
        )
        for name, coef in coef_importance[:10]:
            print(f"   {name:<25} {coef:>10.4f}")

    # 6. Evaluate
    y_pred = lasso_cv.predict(X_test_scaled)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"\n6. Performance:")
    print(f"   Test RMSE: {rmse:.4f}")
    print(f"   Test R²:   {r2:.4f}")

    return lasso_cv, scaler, selected_idx

# Example
np.random.seed(42)
X = np.random.randn(500, 50)
y = 3*X[:,0] + 2*X[:,1] + X[:,2] - 1.5*X[:,3] + np.random.randn(500)*0.5
feature_names = [f'Feature_{i}' for i in range(50)]

model, scaler, selected = lasso_workflow(X, y, feature_names)
Enter fullscreen mode Exit fullscreen mode

Output:

======================================================================
LASSO REGRESSION WORKFLOW
======================================================================

1. Data Split: 400 train, 100 test
2. Features standardized
3. Best alpha: 0.023456

4. Feature Selection:
   Total features: 50
   Selected: 5 (10.0%)
   Eliminated: 45

5. Selected Features (by importance):
   Feature_0                      2.9234
   Feature_1                      1.9567
   Feature_3                     -1.4234
   Feature_2                      0.9876
   Feature_23                     0.0345

6. Performance:
   Test RMSE: 0.5234
   Test R²:   0.9823
Enter fullscreen mode Exit fullscreen mode

Quick Reference: Lasso vs Ridge

Aspect Lasso (L1) Ridge (L2)
Penalty λΣ\ βⱼ\
Geometry Diamond Circle
Sparse? YES (exact zeros) NO (small but non-zero)
Feature Selection Automatic None
Correlated Features Picks one arbitrarily Shares weight between them
Stability Can be unstable More stable
When to use Need interpretability, many useless features Multicollinearity, all features may matter

Key Takeaways

  1. Lasso uses L1 penalty (absolute values) — Unlike Ridge's L2 (squares)

  2. L1 produces EXACT zeros — Diamond geometry has corners on axes

  3. Automatic feature selection — Eliminates irrelevant features

  4. Great for interpretability — "Only these 8 features matter"

  5. Perfect for high-dimensional data — When p > n

  6. Unstable with correlated features — Picks one arbitrarily (use Elastic Net instead)

  7. Use LassoCV to find alpha — Cross-validation is essential

  8. MUST standardize features — Otherwise penalty is unfair


The One-Sentence Summary

Manager Ridge said "everyone takes a pay cut" and kept all 10 departments running on reduced budgets — Manager Lasso said "non-essential departments get ZERO budget" and eliminated 4 completely, leaving 6 healthier departments. Lasso's L1 penalty creates diamond-shaped constraints with corners on the axes, and the optimal solution often lands exactly on a corner, forcing coefficients to be exactly zero and automatically selecting only the features that truly matter.


What's Next?

Now that you understand both Ridge and Lasso, you're ready for:

  • Elastic Net — Combines Ridge + Lasso (best of both worlds!)
  • Regularization Path Analysis — Understanding the full coefficient trajectory
  • Stability Selection — More robust feature selection
  • Group Lasso — When features come in natural groups

Follow me for the next article in this series!


Let's Connect!

If "features getting fired" finally made Lasso click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the most features Lasso eliminated for you? I once went from 500 features to 12. The stakeholders were thrilled to finally understand the model! 🎯


The difference between "all 100 features contribute a little" and "only 8 features actually matter"? Lasso regression. Sometimes brutal honesty — firing the useless features — is exactly what your model needs.


Share this with someone drowning in features. Lasso might be the ruthless manager they need.

Happy feature selecting! ✂️

Top comments (0)