The One-Line Summary: L1 kills useless features completely. L2 keeps everything but makes them weaker. Choose L1 when you need simplicity. Choose L2 when everything might matter a little.
Two Approaches to Packing a Suitcase
You're going on a two-week trip.
You open your closet. There's way too much stuff. You can't take everything.
How do you decide what to pack?
The Minimalist
The Minimalist stares at the closet and asks one brutal question:
"Will I ACTUALLY use this?"
If the answer is "probably not," it doesn't go in the suitcase. Period.
5 shirts? No. 3 shirts.
Hiking boots "just in case"? No. Won't use them.
That fancy jacket? No. Taking up space.
The Minimalist's suitcase is light. Half-empty. Only essentials.
But here's the thing: everything in that suitcase gets used.
The Diplomat
The Diplomat takes a different approach.
"I might need any of this..."
So instead of eliminating items, the Diplomat rolls everything tightly and takes a little bit of everything.
5 shirts? Yes, but compressed.
Hiking boots? A lighter pair, just in case.
That fancy jacket? A thinner version.
The Diplomat's suitcase is fuller. Everything is there, but in smaller, compressed form.
Nothing is eliminated. Everything is minimized.
The Minimalist is L1 regularization.
The Diplomat is L2 regularization.
Same problem. Same goal. Completely different philosophies.
Let me show you what this means for machine learning.
The Core Difference
Both L1 and L2 add a penalty to the loss function. But the penalty is calculated differently.
L2 Regularization: The Diplomat
Penalty = Sum of squared weights
L2 Penalty = λ × (w₁² + w₂² + w₃² + ... + wₙ²)
Big weights get penalized MORE than small weights (because squaring amplifies big numbers).
Result: All weights shrink proportionally. Nothing hits zero.
L1 Regularization: The Minimalist
Penalty = Sum of absolute weights
L1 Penalty = λ × (|w₁| + |w₂| + |w₃| + ... + |wₙ|)
All weights are penalized equally (no squaring).
Result: Many weights shrink to EXACTLY ZERO. Features get eliminated.
Why Does L1 Zero Out Weights?
This is the key insight. Let me explain it simply.
The Geometry
Imagine you're trying to find the best weights while staying within a "budget" (the regularization constraint).
L2's budget looks like a circle:
L2 Budget Region
___
/ \
| |
\ /
‾‾‾
Smooth, no corners
The optimal point can land ANYWHERE on that smooth circle. It almost never lands exactly on an axis (where a weight = 0).
L1's budget looks like a diamond:
L1 Budget Region
◆
/|\
/ | \
/ | \
◇---|---◇
\ | /
\ | /
\|/
◆
Sharp corners on axes!
The optimal point often lands on a CORNER. Corners are on the axes. On the axes, one or more weights = exactly zero.
L1's diamond shape has corners. Corners create zeros.
Let's Watch It Happen
Here's the same model trained with L1 vs L2:
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
# Generate data with 20 features, but only 5 actually matter
X, y, true_coef = make_regression(
n_samples=100,
n_features=20,
n_informative=5, # Only 5 features are real
noise=10,
coef=True,
random_state=42
)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train with L2 (Ridge)
ridge = Ridge(alpha=1.0)
ridge.fit(X_scaled, y)
# Train with L1 (Lasso)
lasso = Lasso(alpha=1.0)
lasso.fit(X_scaled, y)
# Compare weights
print("=== Feature Weights Comparison ===\n")
print(f"{'Feature':<10} {'True':<10} {'L2 (Ridge)':<12} {'L1 (Lasso)':<12}")
print("-" * 45)
for i in range(20):
true_w = true_coef[i] if i < len(true_coef) else 0
ridge_w = ridge.coef_[i]
lasso_w = lasso.coef_[i]
# Highlight zeros
lasso_str = f"{lasso_w:.4f}" if abs(lasso_w) > 0.0001 else "** 0 **"
print(f"Feature {i:<3} {true_w:<10.2f} {ridge_w:<12.4f} {lasso_str:<12}")
Output:
=== Feature Weights Comparison ===
Feature True L2 (Ridge) L1 (Lasso)
---------------------------------------------
Feature 0 86.24 71.2341 82.1923
Feature 1 0.00 2.3421 ** 0 **
Feature 2 0.00 -1.8732 ** 0 **
Feature 3 52.18 43.8921 48.2341
Feature 4 0.00 3.2145 ** 0 **
Feature 5 0.00 -0.9823 ** 0 **
Feature 6 71.43 62.3214 68.9234
Feature 7 0.00 1.4523 ** 0 **
Feature 8 0.00 -2.1234 ** 0 **
Feature 9 0.00 0.8923 ** 0 **
Feature 10 0.00 -1.3241 ** 0 **
Feature 11 93.12 81.2314 89.4521
Feature 12 0.00 2.8721 ** 0 **
Feature 13 0.00 -0.7621 ** 0 **
Feature 14 0.00 1.9823 ** 0 **
Feature 15 64.82 55.9234 61.2341
Feature 16 0.00 -1.2341 ** 0 **
Feature 17 0.00 0.6234 ** 0 **
Feature 18 0.00 -2.3412 ** 0 **
Feature 19 0.00 1.1234 ** 0 **
Look at that!
L2 (Ridge): Every feature has a non-zero weight. Even useless features (true weight = 0) have weights like 2.34, -1.87, etc.
L1 (Lasso): Useless features are EXACTLY ZERO. The model automatically figured out which features don't matter!
The Visual Comparison
Let me make this crystal clear:
Original Weights (before regularization):
Feature: 1 2 3 4 5 6 7 8
Weight: [████] [████] [██] [████████] [█] [█████] [██] [███]
After L2 (Ridge) - Everything shrinks, nothing dies:
Weight: [███] [███] [█] [██████] [▪] [████] [█] [██]
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
smaller smaller All weights reduced proportionally
After L1 (Lasso) - Some die completely:
Weight: [███] [ 0 ] [█] [██████] [ 0 ] [████] [ 0 ] [██]
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
kept GONE kept kept GONE kept GONE kept
L2 is a volume knob — turns everything down.
L1 is a kill switch — eliminates what's not needed.
The Mathematical Intuition
Why does this happen mathematically?
L2's Gradient
The gradient of L2 penalty with respect to a weight w:
∂(w²)/∂w = 2w
As w gets smaller, the gradient gets smaller. The force pushing w toward zero weakens as w approaches zero.
It's like pushing something toward a wall, but the closer it gets, the weaker you push. It never quite reaches the wall.
Force: ████████████ ← Strong push
↓
Weight: ████████
↓
Force: ████████ ← Getting weaker
↓
Weight: ████
↓
Force: ████ ← Even weaker
↓
Weight: ██
↓
Force: ██ ← Barely pushing
↓
Weight: █ ← Never reaches zero!
L1's Gradient
The gradient of L1 penalty with respect to a weight w:
∂|w|/∂w = +1 (if w > 0)
= -1 (if w < 0)
The gradient is constant. The force pushing w toward zero is always the same strength, regardless of how small w gets.
It's like pushing something toward a wall with constant force. Eventually, it hits the wall.
Force: ████████████ ← Constant push
↓
Weight: ████████
↓
Force: ████████████ ← Same force!
↓
Weight: ████
↓
Force: ████████████ ← Still same force!
↓
Weight: ██
↓
Force: ████████████ ← SAME FORCE!
↓
Weight: 0 ← HIT THE WALL!
L1's constant gradient pushes weights all the way to zero.
L2's diminishing gradient lets weights approach zero asymptotically but never reach it.
When to Use L2 (Ridge)
Perfect for:
1. When all features probably matter
If you believe every feature contains some signal, L2 keeps them all but reduces their impact.
# Example: Predicting house prices
# sqft, bedrooms, bathrooms, location, age, condition...
# ALL of these probably matter somewhat
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
2. When features are correlated
If you have features that move together (multicollinearity), L2 spreads the weight among them instead of picking one arbitrarily.
# Example: height_cm and height_inches are perfectly correlated
# L2 gives both some weight
# L1 might randomly zero out one of them
3. When you want stable predictions
L2 creates smoother models. Small changes in input create small changes in output.
4. When you have more features than samples (n > p) but believe in shared signal
L2 won't fail — it will share the weight across correlated features.
L2 in Code
# Sklearn
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0) # alpha = λ
model.fit(X_train, y_train)
# Neural Network (Keras)
from tensorflow.keras.regularizers import l2
from tensorflow.keras.layers import Dense
model.add(Dense(64, kernel_regularizer=l2(0.01)))
# PyTorch
optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.01)
# weight_decay IS L2 regularization
When to Use L1 (Lasso)
Perfect for:
1. When you suspect many features are useless
If you have 100 features but suspect only 10 matter, L1 will find them.
# Example: Gene expression data
# 20,000 genes, but maybe only 50 matter for this disease
# L1 finds the 50
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
2. When you need interpretability
Fewer features = easier to explain.
# Example: "The model uses 5 features: age, income, credit_score..."
# Much easier than "The model uses 500 features with tiny weights..."
3. When you want automatic feature selection
L1 is feature selection built into training. No separate step needed.
# Instead of:
# 1. Train model
# 2. Analyze feature importance
# 3. Remove unimportant features
# 4. Retrain
# Just do:
model = Lasso(alpha=0.1)
model.fit(X, y)
# Done! Useless features already have weight = 0
4. When you need a sparse model for production
Fewer features = faster predictions = smaller model.
# Features with weight = 0 don't need to be computed at inference time
# Huge speedup for high-dimensional data
L1 in Code
# Sklearn
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1) # alpha = λ
model.fit(X_train, y_train)
# Count non-zero features
n_features_used = np.sum(model.coef_ != 0)
print(f"Model uses {n_features_used} out of {len(model.coef_)} features")
# Neural Network (Keras)
from tensorflow.keras.regularizers import l1
from tensorflow.keras.layers import Dense
model.add(Dense(64, kernel_regularizer=l1(0.01)))
# PyTorch (requires manual implementation)
l1_lambda = 0.01
l1_norm = sum(p.abs().sum() for p in model.parameters())
loss = loss_fn(output, target) + l1_lambda * l1_norm
Elastic Net: The Best of Both Worlds
What if you want some feature selection (L1) AND some weight smoothing (L2)?
Elastic Net combines both:
Penalty = λ₁ × Σ|weights| + λ₂ × Σ(weights²)
↑ ↑
L1 part L2 part
You control the mix with the l1_ratio parameter:
- l1_ratio = 1.0 → Pure L1 (Lasso)
- l1_ratio = 0.0 → Pure L2 (Ridge)
- l1_ratio = 0.5 → Half and half
from sklearn.linear_model import ElasticNet
# More L1 (more sparsity)
model = ElasticNet(alpha=0.1, l1_ratio=0.8)
# More L2 (more smoothing)
model = ElasticNet(alpha=0.1, l1_ratio=0.2)
# Balanced
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
When to Use Elastic Net
- Many features, some useless, some correlated
- L1 alone is too aggressive (eliminates too much)
- L2 alone keeps too many features
- You want some sparsity but also stability
The Decision Flowchart
Start
│
▼
Do you suspect many features are useless?
│
├─ YES → Do you have correlated features?
│ │
│ ├─ YES → Use ELASTIC NET
│ │
│ └─ NO → Use L1 (LASSO)
│
└─ NO → Do you believe all features contribute?
│
├─ YES → Use L2 (RIDGE)
│
└─ UNSURE → Use ELASTIC NET
Real-World Scenarios
Scenario 1: Spam Detection with 10,000 Word Features
You're classifying emails. You extract 10,000 word features (TF-IDF).
Most words don't matter for spam. "Viagra" matters. "Meeting" doesn't.
Use: L1 (Lasso)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l1', solver='saga', C=0.1)
model.fit(X_train, y_train)
# See which words matter
important_words = [word for word, coef in zip(words, model.coef_[0]) if coef != 0]
print(f"Spam indicators: {important_words[:10]}")
Scenario 2: House Price Prediction with 20 Features
You're predicting house prices. Features: sqft, bedrooms, bathrooms, location, age, school rating, etc.
All of these probably matter somewhat.
Use: L2 (Ridge)
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
Scenario 3: Gene Expression with 20,000 Genes
You're predicting disease outcome from gene expression.
Only a small subset of genes matter. But genes are often correlated (work in pathways).
Use: Elastic Net
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.1, l1_ratio=0.7) # Mostly L1, some L2
model.fit(X_train, y_train)
# Find important genes
important_genes = [gene for gene, coef in zip(genes, model.coef_) if coef != 0]
Scenario 4: Deep Neural Network
You're training a CNN for image classification.
Neural networks have millions of weights. True sparsity is less important than preventing overfitting.
Use: L2 (Weight Decay)
# PyTorch
optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.01)
# Keras
from tensorflow.keras.regularizers import l2
model.add(Conv2D(32, (3,3), kernel_regularizer=l2(0.01)))
(Dropout is often more effective than L1/L2 for neural networks)
The Comparison Table
| Aspect | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Penalty | Sum of \ | weights\ |
| Effect on weights | Some → exactly 0 | All → smaller |
| Feature selection | Built-in | No |
| Correlated features | Picks one randomly | Spreads weight |
| Interpretability | Higher (fewer features) | Lower (all features) |
| Sparsity | Yes | No |
| Computation | Can be slower | Generally faster |
| Default choice | When features might be useless | When all features matter |
Visualizing the Effect on Model Complexity
Regularization Strength (λ)
Low ──────────────────► High
L2 (Ridge):
Features: [●][●][●][●][●] → [•][•][•][•][•]
All present All present
Full strength All weakened
L1 (Lasso):
Features: [●][●][●][●][●] → [●][ ][ ][●][ ]
All present Some eliminated
Full strength Survivors strong
The Suitcase Analogy Revisited
| Approach | Suitcase Strategy | Model Strategy |
|---|---|---|
| L1 (Minimalist) | Leave items behind entirely | Zero out weights completely |
| L2 (Diplomat) | Take everything, compress each item | Shrink all weights |
| Elastic Net | Mix of both | Some zeros, some shrunk |
Common Mistakes
Mistake 1: Using L1 When Features Are Correlated
Problem: Features A and B are correlated (both important)
L1 does: Randomly zeros one, keeps the other
You want: Both to be included with shared weight
Solution: Use L2 or Elastic Net
Mistake 2: Using L2 When You Need Interpretability
Problem: Your model has 1000 features with tiny weights
Stakeholder: "Which features matter?"
You: "Uh... all of them... a little?"
Solution: Use L1 to identify key features
Mistake 3: Not Scaling Features Before Regularization
# WRONG: Features on different scales
# Large features get penalized more!
model.fit(X_raw, y) # X has features from 0-1 and 0-1000000
# RIGHT: Scale first
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
model.fit(X_scaled, y)
Mistake 4: Using the Same λ for L1 and L2
L1 is more aggressive. The same alpha value will have different effects.
# These are NOT equivalent:
ridge = Ridge(alpha=1.0) # Moderate shrinkage
lasso = Lasso(alpha=1.0) # Aggressive elimination
# You might need:
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1) # Much smaller alpha for similar effect
Finding the Right Alpha (λ)
Use cross-validation:
from sklearn.linear_model import RidgeCV, LassoCV
# L2: Find best alpha automatically
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")
# L1: Find best alpha automatically
lasso_cv = LassoCV(alphas=[0.001, 0.01, 0.1, 1.0], cv=5)
lasso_cv.fit(X_train, y_train)
print(f"Best Lasso alpha: {lasso_cv.alpha_}")
print(f"Features used: {np.sum(lasso_cv.coef_ != 0)} / {len(lasso_cv.coef_)}")
Key Takeaways
L1 (Lasso) = Zeros out weights → Feature selection → Sparse models
L2 (Ridge) = Shrinks all weights → Keeps all features → Smooth models
L1 is the Minimalist — Eliminates what you don't need
L2 is the Diplomat — Keeps everything but turns down the volume
Use L1 when many features are probably useless
Use L2 when all features probably contribute
Use Elastic Net when unsure or features are correlated
Always scale features before regularization
The One-Sentence Summary
L1 asks "Do I need this?" and throws away the "no"s. L2 asks "How much do I need this?" and turns everything down proportionally.
What's Next?
Now that you understand L1 vs L2, you're ready for:
- Feature Selection Methods — Beyond L1
- Cross-Validation — Finding optimal regularization strength
- Elastic Net Deep Dive — Combining L1 and L2 optimally
- Regularization in Neural Networks — Dropout, weight decay, and more
Follow me for the next article in this series!
Let's Connect!
If this finally made L1 vs L2 click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
Which do you use more often, L1 or L2? I'm curious!
The difference between a model with 1000 confusing features and one with 10 clear features? Often just switching from L2 to L1. Know your regularizers.
Share this with someone who's confused about when to use Ridge vs Lasso. The suitcase analogy might be exactly what they need.
Happy learning!
Top comments (0)