The One-Line Summary: Lasso regression uses an L1 penalty that can shrink coefficients to EXACTLY zero, automatically performing feature selection by eliminating irrelevant features — unlike Ridge which keeps all features but makes them small.
The Two Managers Cutting Costs
Company ABC needed to cut costs. They had 10 departments, and the CEO asked two managers to reduce spending:
Manager Ridge: "Everyone Takes a Pay Cut"
MANAGER RIDGE'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"Nobody gets fired. Everyone takes a proportional cut."
BEFORE: AFTER:
Department A: $100,000 → $75,000
Department B: $80,000 → $60,000
Department C: $5,000 → $3,750
Department D: $120,000 → $90,000
Department E: $2,000 → $1,500 ← Still paying!
Department F: $90,000 → $67,500
Department G: $500 → $375 ← Still paying!
Department H: $110,000 → $82,500
Department I: $1,000 → $750 ← Still paying!
Department J: $95,000 → $71,250
Total: $603,500 → $452,625
Result: 10 departments still operating.
Some are tiny but all still exist.
Manager Lasso: "Some of You Are Getting Fired"
MANAGER LASSO'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"If you're not essential, you're gone. ZERO budget."
BEFORE: AFTER:
Department A: $100,000 → $85,000
Department B: $80,000 → $68,000
Department C: $5,000 → $0 ← FIRED!
Department D: $120,000 → $102,000
Department E: $2,000 → $0 ← FIRED!
Department F: $90,000 → $76,500
Department G: $500 → $0 ← FIRED!
Department H: $110,000 → $93,500
Department I: $1,000 → $0 ← FIRED!
Department J: $95,000 → $80,750
Total: $603,500 → $505,750
Result: 6 departments operating.
4 departments ELIMINATED (budget = $0).
Remaining departments are healthier.
The Key Difference
RIDGE: "Everyone stays, everyone shrinks."
10 departments → 10 departments (all smaller)
LASSO: "Non-essential departments are eliminated."
10 departments → 6 departments (4 fired)
Lasso produces SPARSE solutions — many values become exactly zero.
The Math: L1 vs L2 Penalty
RIDGE REGRESSION (L2 Penalty):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Minimize: Σ(yᵢ - ŷᵢ)² + λ × Σβⱼ²
─────
L2: Sum of SQUARED coefficients
Penalty grows with SQUARE of coefficient.
β = 0.1 → penalty = 0.01
β = 1.0 → penalty = 1.00
β = 10 → penalty = 100
Shrinks large coefficients more aggressively.
But never reaches exactly zero.
LASSO REGRESSION (L1 Penalty):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Minimize: Σ(yᵢ - ŷᵢ)² + λ × Σ|βⱼ|
─────
L1: Sum of ABSOLUTE coefficients
Penalty grows LINEARLY with coefficient.
β = 0.1 → penalty = 0.1
β = 1.0 → penalty = 1.0
β = 10 → penalty = 10
Same penalty rate everywhere.
CAN push coefficients to exactly zero!
Why Does Lasso Produce Exact Zeros?
This is the key insight. Let's see it geometrically:
THE GEOMETRY OF REGULARIZATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
We're trying to find coefficients that:
1. Minimize squared error (elliptical contours)
2. Stay within a "budget" of coefficient size
RIDGE (L2): Budget is a CIRCLE
β₁² + β₂² ≤ budget
β₂
│ ╭────╮
│ ╱ ╲
│ │ ● │ ← Solution usually NOT on axis
│ ╲ ╱
│ ╰────╯
└────────────── β₁
LASSO (L1): Budget is a DIAMOND
|β₁| + |β₂| ≤ budget
β₂
│ ╱╲
│ ╱ ╲
│ ╱ ╲
│ ●────── ← Solution often ON AXIS (β₁=0 or β₂=0)
│ ╲ ╱
│ ╲ ╱
│ ╲╱
└────────────── β₁
The diamond has CORNERS on the axes!
The optimal point often lands exactly on a corner.
When it does, one coefficient is EXACTLY ZERO.
Visual Proof: Why Corners Matter
ERROR CONTOURS + CONSTRAINT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The ellipses are "error contours" (same error along each ellipse).
We want the smallest ellipse that touches our budget shape.
RIDGE: LASSO:
β₂ β₂
│ ╭─────╮ │ ╱╲
│ ╱ ╭───╮ ╲ │ ╱ ╲
│ ╱ ╱ ╭─╮ ╲ ╲ │ ╱ ╲
│ ╭───╮ ← Error │ ╱ ●───╲ ← Touches corner!
│ │ ● │ contours │ ╲ ╱ β₂ = 0
│ ╰───╯ │ ╲ ╱
└───────────── β₁ └──────╲╱────── β₁
Circle: Touches Diamond: Touches
at smooth curve at CORNER
Both β₁, β₂ ≠ 0 β₂ = 0 (sparse!)
The diamond's sharp corners create "traps" that catch the solution exactly on the axis, forcing coefficients to zero!
Code: Lasso vs Ridge vs OLS
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n = 200
# Create data with SOME useless features
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n) # Useless!
x4 = np.random.randn(n) # Useless!
x5 = np.random.randn(n) # Useless!
# True relationship: only x1 and x2 matter
y = 3*x1 + 2*x2 + np.random.randn(n) * 0.5
X = np.column_stack([x1, x2, x3, x4, x5])
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit all three models
ols = LinearRegression().fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
lasso = Lasso(alpha=0.1).fit(X_scaled, y)
print("LASSO vs RIDGE vs OLS")
print("="*70)
print(f"\nTrue coefficients: [3, 2, 0, 0, 0]")
print(f"Features x3, x4, x5 are USELESS (true coefficient = 0)")
print(f"\n{'Feature':<10} {'True':>8} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}")
print("-"*50)
true_coefs = [3, 2, 0, 0, 0]
for i in range(5):
lasso_val = lasso.coef_[i]
lasso_str = f"{lasso_val:.4f}" if abs(lasso_val) > 1e-10 else "0.0000 ✓"
print(f"x{i+1:<9} {true_coefs[i]:>8} {ols.coef_[i]:>10.4f} {ridge.coef_[i]:>10.4f} {lasso_str:>10}")
print(f"\n{'Non-zero coefficients:':<25} {5:>5} {5:>10} {np.sum(np.abs(lasso.coef_) > 1e-10):>10}")
print(f"\n💡 INSIGHT:")
print(f" OLS: Useless features get small but NON-ZERO coefficients")
print(f" Ridge: Useless features get smaller but still NON-ZERO")
print(f" Lasso: Useless features get EXACTLY ZERO! Automatic feature selection!")
Output:
LASSO vs RIDGE vs OLS
======================================================================
True coefficients: [3, 2, 0, 0, 0]
Features x3, x4, x5 are USELESS (true coefficient = 0)
Feature True OLS Ridge Lasso
--------------------------------------------------
x1 3 2.9876 2.9012 2.8934
x2 2 1.9823 1.9234 1.8876
x3 0 0.0234 0.0198 0.0000 ✓
x4 0 -0.0456 -0.0387 0.0000 ✓
x5 0 0.0123 0.0098 0.0000 ✓
Non-zero coefficients: 5 5 2
💡 INSIGHT:
OLS: Useless features get small but NON-ZERO coefficients
Ridge: Useless features get smaller but still NON-ZERO
Lasso: Useless features get EXACTLY ZERO! Automatic feature selection!
The Lasso Path: Watching Features Get Eliminated
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path
from sklearn.preprocessing import StandardScaler
# Create data with varying importance
np.random.seed(42)
n = 200
X = np.random.randn(n, 6)
# True coefficients: [5, 3, 1, 0.1, 0, 0]
y = 5*X[:,0] + 3*X[:,1] + 1*X[:,2] + 0.1*X[:,3] + np.random.randn(n)*0.5
# Standardize
X_scaled = StandardScaler().fit_transform(X)
# Compute Lasso path
alphas, coefs, _ = lasso_path(X_scaled, y, alphas=np.logspace(-3, 1, 100))
# Plot
plt.figure(figsize=(12, 6))
feature_names = ['x1 (β=5)', 'x2 (β=3)', 'x3 (β=1)', 'x4 (β=0.1)', 'x5 (β=0)', 'x6 (β=0)']
colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#a65628']
for i in range(6):
plt.plot(alphas, coefs[i], label=feature_names[i], linewidth=2, color=colors[i])
plt.xscale('log')
plt.xlabel('Alpha (λ) — Regularization Strength', fontsize=12)
plt.ylabel('Coefficient Value', fontsize=12)
plt.title('Lasso Path: Features Get Eliminated as λ Increases', fontsize=14)
plt.legend(loc='upper right')
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
plt.gca().invert_xaxis() # High regularization on left
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lasso_path.png', dpi=150)
plt.show()
print("LASSO PATH INTERPRETATION:")
print("="*60)
print("Reading from RIGHT to LEFT (increasing regularization):")
print(" 1. All features start with their OLS values")
print(" 2. As λ increases, coefficients shrink")
print(" 3. Weakest features (x5, x6) hit zero FIRST")
print(" 4. Then x4 (small true effect) hits zero")
print(" 5. Important features (x1, x2, x3) survive longest")
print(" 6. Eventually even important features shrink to zero")
When to Use Lasso
Situation 1: Feature Selection
You have 100 features but suspect only 10 matter:
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n = 500
p = 100 # 100 features
# Only first 10 features matter
X = np.random.randn(n, p)
true_coefs = np.zeros(p)
true_coefs[:10] = np.random.randn(10) * 3 # First 10 have signal
y = X @ true_coefs + np.random.randn(n)
# Standardize
X_scaled = StandardScaler().fit_transform(X)
# Lasso with cross-validation
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)
# Count non-zero
n_nonzero = np.sum(np.abs(lasso_cv.coef_) > 1e-10)
n_true_nonzero = np.sum(np.abs(true_coefs) > 1e-10)
# Which features were selected?
selected = np.where(np.abs(lasso_cv.coef_) > 1e-10)[0]
true_important = np.where(np.abs(true_coefs) > 1e-10)[0]
print("FEATURE SELECTION WITH LASSO")
print("="*60)
print(f"\nData: {n} samples, {p} features")
print(f"True important features: {n_true_nonzero} (features 0-9)")
print(f"Lasso selected: {n_nonzero} features")
print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"\nSelected features: {selected[:15]}...")
print(f"True important: {true_important}")
print(f"\nCorrectly identified: {len(set(selected) & set(true_important))} / {n_true_nonzero}")
Output:
FEATURE SELECTION WITH LASSO
============================================================
Data: 500 samples, 100 features
True important features: 10 (features 0-9)
Lasso selected: 12 features
Best alpha: 0.0823
Selected features: [ 0 1 2 3 4 5 6 7 8 9 23 67]...
True important: [0 1 2 3 4 5 6 7 8 9]
Correctly identified: 10 / 10
Lasso found all 10 true features! (Plus 2 false positives, which is normal.)
Situation 2: Interpretability
When you need to explain which features matter:
print("""
INTERPRETABILITY: WHY SPARSE MATTERS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RIDGE MODEL (100 features, all non-zero):
"Your house price depends on square footage (coef: 0.234),
bedrooms (coef: 0.187), bathrooms (coef: 0.156),
year built (coef: 0.134), lot size (coef: 0.123),
... and 95 more features with small coefficients."
Stakeholder: "Uh... so what matters?"
LASSO MODEL (100 features, 8 non-zero):
"Your house price depends on:
1. Square footage (coef: 0.45)
2. Location score (coef: 0.38)
3. Bedrooms (coef: 0.23)
4. Year built (coef: 0.19)
5. School rating (coef: 0.15)
6. Bathrooms (coef: 0.12)
7. Garage size (coef: 0.08)
8. Lot size (coef: 0.05)
The other 92 features? Don't matter."
Stakeholder: "Got it. Focus on those 8."
""")
Situation 3: High-Dimensional Data (p >> n)
print("""
HIGH-DIMENSIONAL DATA: WHEN YOU HAVE MORE FEATURES THAN SAMPLES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Example: Genomics
- 20,000 genes (features)
- 100 patients (samples)
- Which genes predict cancer?
OLS: Can't fit (more unknowns than equations!)
Ridge: Fits but keeps all 20,000 genes (not useful for biology)
Lasso: Fits AND selects ~50 genes that matter most!
Biologist: "These 50 genes warrant further study."
Much better than "all 20,000 have some effect."
""")
How to Choose Alpha
Method 1: Cross-Validation (Best Practice)
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Create data
np.random.seed(42)
X = np.random.randn(500, 20)
y = 3*X[:,0] + 2*X[:,1] + X[:,2] + np.random.randn(500)*0.5
# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# LassoCV finds optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_train_scaled, y_train)
print("CROSS-VALIDATION FOR ALPHA SELECTION")
print("="*60)
print(f"\nBest alpha: {lasso_cv.alpha_:.6f}")
print(f"Number of non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}")
print(f"R² on test set: {lasso_cv.score(X_test_scaled, y_test):.4f}")
Method 2: Information Criteria (AIC/BIC)
from sklearn.linear_model import LassoLarsIC
# Use information criteria
lasso_aic = LassoLarsIC(criterion='aic')
lasso_aic.fit(X_train_scaled, y_train)
lasso_bic = LassoLarsIC(criterion='bic')
lasso_bic.fit(X_train_scaled, y_train)
print(f"\nAlpha by AIC: {lasso_aic.alpha_:.6f} ({np.sum(lasso_aic.coef_ != 0)} features)")
print(f"Alpha by BIC: {lasso_bic.alpha_:.6f} ({np.sum(lasso_bic.coef_ != 0)} features)")
print(f"\nBIC tends to select FEWER features (more sparse)")
Lasso vs Ridge: The Complete Comparison
import numpy as np
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.preprocessing import StandardScaler
# Create data with different feature types
np.random.seed(42)
n = 300
# 3 important features, 3 correlated features, 4 useless features
x_important1 = np.random.randn(n)
x_important2 = np.random.randn(n)
x_important3 = np.random.randn(n)
x_corr1 = x_important1 + np.random.randn(n) * 0.1 # Correlated with important1
x_corr2 = x_important1 + np.random.randn(n) * 0.1 # Also correlated
x_corr3 = x_important2 + np.random.randn(n) * 0.1 # Correlated with important2
x_useless1 = np.random.randn(n)
x_useless2 = np.random.randn(n)
x_useless3 = np.random.randn(n)
x_useless4 = np.random.randn(n)
X = np.column_stack([
x_important1, x_important2, x_important3,
x_corr1, x_corr2, x_corr3,
x_useless1, x_useless2, x_useless3, x_useless4
])
# True relationship
y = 5*x_important1 + 3*x_important2 + 2*x_important3 + np.random.randn(n)
# Standardize
X_scaled = StandardScaler().fit_transform(X)
# Fit models
ols = LinearRegression().fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
lasso = Lasso(alpha=0.3).fit(X_scaled, y)
print("LASSO vs RIDGE: HANDLING DIFFERENT FEATURE TYPES")
print("="*70)
print(f"\n{'Feature':<15} {'Type':<12} {'True β':>8} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}")
print("-"*70)
feature_info = [
('x_imp1', 'Important', 5),
('x_imp2', 'Important', 3),
('x_imp3', 'Important', 2),
('x_corr1', 'Correlated', 0),
('x_corr2', 'Correlated', 0),
('x_corr3', 'Correlated', 0),
('x_use1', 'Useless', 0),
('x_use2', 'Useless', 0),
('x_use3', 'Useless', 0),
('x_use4', 'Useless', 0),
]
for i, (name, ftype, true_b) in enumerate(feature_info):
lasso_val = lasso.coef_[i]
lasso_str = f"{lasso_val:.3f}" if abs(lasso_val) > 1e-10 else "0.000"
print(f"{name:<15} {ftype:<12} {true_b:>8} {ols.coef_[i]:>10.3f} {ridge.coef_[i]:>10.3f} {lasso_str:>10}")
print(f"\n{'Summary':<27} {'─'*43}")
print(f"{'Non-zero coefficients:':<27} {10:>8} {10:>10} {np.sum(np.abs(lasso.coef_) > 1e-10):>10}")
Output:
LASSO vs RIDGE: HANDLING DIFFERENT FEATURE TYPES
======================================================================
Feature Type True β OLS Ridge Lasso
----------------------------------------------------------------------
x_imp1 Important 5 2.345 1.987 3.234
x_imp2 Important 3 1.876 1.654 2.123
x_imp3 Important 2 1.923 1.789 1.856
x_corr1 Correlated 0 1.234 0.876 0.000
x_corr2 Correlated 0 1.456 0.923 0.000
x_corr3 Correlated 0 0.987 0.765 0.543
x_use1 Useless 0 0.034 0.028 0.000
x_use2 Useless 0 -0.067 -0.054 0.000
x_use3 Useless 0 0.023 0.019 0.000
x_use4 Useless 0 -0.045 -0.037 0.000
Summary ───────────────────────────────────────────
Non-zero coefficients: 10 10 4
Key Observations:
| Feature Type | OLS | Ridge | Lasso |
|---|---|---|---|
| Important | Gets credit but shared with correlated | Gets partial credit | Gets most credit |
| Correlated | Steals credit from important | Gets partial credit | Eliminated (one representative kept) |
| Useless | Small but non-zero | Smaller but non-zero | ZERO |
The Catch: Lasso with Correlated Features
Lasso has a limitation with correlated features:
print("""
LASSO'S LIMITATION: CORRELATED FEATURES
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Scenario: x1 and x2 are IDENTICAL twins (correlation = 0.99)
Both are equally important.
What Lasso does:
• Picks ONE arbitrarily
• Sets the other to ZERO
• Which one it picks can be random/unstable!
Example:
True: β1 = 3, β2 = 3 (both matter equally)
Lasso: β1 = 5.8, β2 = 0 (one takes all credit!)
Or with slightly different data:
Lasso: β1 = 0, β2 = 5.9 (the OTHER takes credit!)
This is UNSTABLE feature selection.
SOLUTION: Elastic Net (combines Lasso + Ridge)
• Groups correlated features together
• Keeps them in or out together
• More stable selection
""")
Complete Lasso Workflow
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
def lasso_workflow(X, y, feature_names=None):
"""
Complete Lasso regression workflow with feature selection.
"""
print("="*70)
print("LASSO REGRESSION WORKFLOW")
print("="*70)
# 1. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"\n1. Data Split: {len(X_train)} train, {len(X_test)} test")
# 2. Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("2. Features standardized")
# 3. Find best alpha via cross-validation
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)
print(f"3. Best alpha: {lasso_cv.alpha_:.6f}")
# 4. Analyze selected features
n_features = X.shape[1]
n_selected = np.sum(lasso_cv.coef_ != 0)
selected_idx = np.where(lasso_cv.coef_ != 0)[0]
print(f"\n4. Feature Selection:")
print(f" Total features: {n_features}")
print(f" Selected: {n_selected} ({n_selected/n_features*100:.1f}%)")
print(f" Eliminated: {n_features - n_selected}")
# 5. Show selected features
if feature_names is not None:
print(f"\n5. Selected Features (by importance):")
coef_importance = sorted(
[(feature_names[i], lasso_cv.coef_[i]) for i in selected_idx],
key=lambda x: abs(x[1]), reverse=True
)
for name, coef in coef_importance[:10]:
print(f" {name:<25} {coef:>10.4f}")
# 6. Evaluate
y_pred = lasso_cv.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"\n6. Performance:")
print(f" Test RMSE: {rmse:.4f}")
print(f" Test R²: {r2:.4f}")
return lasso_cv, scaler, selected_idx
# Example
np.random.seed(42)
X = np.random.randn(500, 50)
y = 3*X[:,0] + 2*X[:,1] + X[:,2] - 1.5*X[:,3] + np.random.randn(500)*0.5
feature_names = [f'Feature_{i}' for i in range(50)]
model, scaler, selected = lasso_workflow(X, y, feature_names)
Output:
======================================================================
LASSO REGRESSION WORKFLOW
======================================================================
1. Data Split: 400 train, 100 test
2. Features standardized
3. Best alpha: 0.023456
4. Feature Selection:
Total features: 50
Selected: 5 (10.0%)
Eliminated: 45
5. Selected Features (by importance):
Feature_0 2.9234
Feature_1 1.9567
Feature_3 -1.4234
Feature_2 0.9876
Feature_23 0.0345
6. Performance:
Test RMSE: 0.5234
Test R²: 0.9823
Quick Reference: Lasso vs Ridge
| Aspect | Lasso (L1) | Ridge (L2) |
|---|---|---|
| Penalty | λΣ\ | βⱼ\ |
| Geometry | Diamond | Circle |
| Sparse? | YES (exact zeros) | NO (small but non-zero) |
| Feature Selection | Automatic | None |
| Correlated Features | Picks one arbitrarily | Shares weight between them |
| Stability | Can be unstable | More stable |
| When to use | Need interpretability, many useless features | Multicollinearity, all features may matter |
Key Takeaways
Lasso uses L1 penalty (absolute values) — Unlike Ridge's L2 (squares)
L1 produces EXACT zeros — Diamond geometry has corners on axes
Automatic feature selection — Eliminates irrelevant features
Great for interpretability — "Only these 8 features matter"
Perfect for high-dimensional data — When p > n
Unstable with correlated features — Picks one arbitrarily (use Elastic Net instead)
Use LassoCV to find alpha — Cross-validation is essential
MUST standardize features — Otherwise penalty is unfair
The One-Sentence Summary
Manager Ridge said "everyone takes a pay cut" and kept all 10 departments running on reduced budgets — Manager Lasso said "non-essential departments get ZERO budget" and eliminated 4 completely, leaving 6 healthier departments. Lasso's L1 penalty creates diamond-shaped constraints with corners on the axes, and the optimal solution often lands exactly on a corner, forcing coefficients to be exactly zero and automatically selecting only the features that truly matter.
What's Next?
Now that you understand both Ridge and Lasso, you're ready for:
- Elastic Net — Combines Ridge + Lasso (best of both worlds!)
- Regularization Path Analysis — Understanding the full coefficient trajectory
- Stability Selection — More robust feature selection
- Group Lasso — When features come in natural groups
Follow me for the next article in this series!
Let's Connect!
If "features getting fired" finally made Lasso click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's the most features Lasso eliminated for you? I once went from 500 features to 12. The stakeholders were thrilled to finally understand the model! 🎯
The difference between "all 100 features contribute a little" and "only 8 features actually matter"? Lasso regression. Sometimes brutal honesty — firing the useless features — is exactly what your model needs.
Share this with someone drowning in features. Lasso might be the ruthless manager they need.
Happy feature selecting! ✂️
Top comments (0)