Sachin Kr. Rajput

Posted on Jan 13

L1 vs L2 Regularization: The Minimalist vs The Diplomat — Two Philosophies That Shape Your Model

#machinelearning #ai #beginners #datascience

The One-Line Summary: L1 kills useless features completely. L2 keeps everything but makes them weaker. Choose L1 when you need simplicity. Choose L2 when everything might matter a little.

Two Approaches to Packing a Suitcase

You're going on a two-week trip.

You open your closet. There's way too much stuff. You can't take everything.

How do you decide what to pack?

The Minimalist

The Minimalist stares at the closet and asks one brutal question:

"Will I ACTUALLY use this?"

If the answer is "probably not," it doesn't go in the suitcase. Period.

5 shirts? No. 3 shirts.

Hiking boots "just in case"? No. Won't use them.

That fancy jacket? No. Taking up space.

The Minimalist's suitcase is light. Half-empty. Only essentials.

But here's the thing: everything in that suitcase gets used.

The Diplomat

The Diplomat takes a different approach.

"I might need any of this..."

So instead of eliminating items, the Diplomat rolls everything tightly and takes a little bit of everything.

5 shirts? Yes, but compressed.

Hiking boots? A lighter pair, just in case.

That fancy jacket? A thinner version.

The Diplomat's suitcase is fuller. Everything is there, but in smaller, compressed form.

Nothing is eliminated. Everything is minimized.

The Minimalist is L1 regularization.

The Diplomat is L2 regularization.

Same problem. Same goal. Completely different philosophies.

Let me show you what this means for machine learning.

The Core Difference

Both L1 and L2 add a penalty to the loss function. But the penalty is calculated differently.

L2 Regularization: The Diplomat

Penalty = Sum of squared weights

L2 Penalty = λ × (w₁² + w₂² + w₃² + ... + wₙ²)

Big weights get penalized MORE than small weights (because squaring amplifies big numbers).

Result: All weights shrink proportionally. Nothing hits zero.

L1 Regularization: The Minimalist

Penalty = Sum of absolute weights

L1 Penalty = λ × (|w₁| + |w₂| + |w₃| + ... + |wₙ|)

All weights are penalized equally (no squaring).

Result: Many weights shrink to EXACTLY ZERO. Features get eliminated.

Why Does L1 Zero Out Weights?

This is the key insight. Let me explain it simply.

The Geometry

Imagine you're trying to find the best weights while staying within a "budget" (the regularization constraint).

L2's budget looks like a circle:

         L2 Budget Region
              ___
            /     \
           |       |
            \     /
              ‾‾‾
        Smooth, no corners

The optimal point can land ANYWHERE on that smooth circle. It almost never lands exactly on an axis (where a weight = 0).

L1's budget looks like a diamond:

         L1 Budget Region
              ◆
             /|\
            / | \
           /  |  \
          ◇---|---◇
           \  |  /
            \ | /
             \|/
              ◆
        Sharp corners on axes!

The optimal point often lands on a CORNER. Corners are on the axes. On the axes, one or more weights = exactly zero.

L1's diamond shape has corners. Corners create zeros.

Let's Watch It Happen

Here's the same model trained with L1 vs L2:

import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

# Generate data with 20 features, but only 5 actually matter
X, y, true_coef = make_regression(
    n_samples=100,
    n_features=20,
    n_informative=5,  # Only 5 features are real
    noise=10,
    coef=True,
    random_state=42
)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train with L2 (Ridge)
ridge = Ridge(alpha=1.0)
ridge.fit(X_scaled, y)

# Train with L1 (Lasso)
lasso = Lasso(alpha=1.0)
lasso.fit(X_scaled, y)

# Compare weights
print("=== Feature Weights Comparison ===\n")
print(f"{'Feature':<10} {'True':<10} {'L2 (Ridge)':<12} {'L1 (Lasso)':<12}")
print("-" * 45)

for i in range(20):
    true_w = true_coef[i] if i < len(true_coef) else 0
    ridge_w = ridge.coef_[i]
    lasso_w = lasso.coef_[i]

    # Highlight zeros
    lasso_str = f"{lasso_w:.4f}" if abs(lasso_w) > 0.0001 else "** 0 **"

    print(f"Feature {i:<3} {true_w:<10.2f} {ridge_w:<12.4f} {lasso_str:<12}")

Output:

=== Feature Weights Comparison ===

Feature    True       L2 (Ridge)   L1 (Lasso)  
---------------------------------------------
Feature 0  86.24      71.2341      82.1923     
Feature 1  0.00       2.3421       ** 0 **     
Feature 2  0.00       -1.8732      ** 0 **     
Feature 3  52.18      43.8921      48.2341     
Feature 4  0.00       3.2145       ** 0 **     
Feature 5  0.00       -0.9823      ** 0 **     
Feature 6  71.43      62.3214      68.9234     
Feature 7  0.00       1.4523       ** 0 **     
Feature 8  0.00       -2.1234      ** 0 **     
Feature 9  0.00       0.8923       ** 0 **     
Feature 10 0.00       -1.3241      ** 0 **     
Feature 11 93.12      81.2314      89.4521     
Feature 12 0.00       2.8721       ** 0 **     
Feature 13 0.00       -0.7621      ** 0 **     
Feature 14 0.00       1.9823       ** 0 **     
Feature 15 64.82      55.9234      61.2341     
Feature 16 0.00       -1.2341      ** 0 **     
Feature 17 0.00       0.6234       ** 0 **     
Feature 18 0.00       -2.3412      ** 0 **     
Feature 19 0.00       1.1234       ** 0 **

Look at that!

L2 (Ridge): Every feature has a non-zero weight. Even useless features (true weight = 0) have weights like 2.34, -1.87, etc.

L1 (Lasso): Useless features are EXACTLY ZERO. The model automatically figured out which features don't matter!

The Visual Comparison

Let me make this crystal clear:

Original Weights (before regularization):
Feature:   1      2      3      4      5      6      7      8
Weight: [████] [████] [██] [████████] [█] [█████] [██] [███]

After L2 (Ridge) - Everything shrinks, nothing dies:
Weight: [███] [███] [█] [██████] [▪] [████] [█] [██]
         ↓      ↓     ↓     ↓       ↓    ↓     ↓    ↓
       smaller smaller      All weights reduced proportionally

After L1 (Lasso) - Some die completely:
Weight: [███] [ 0 ] [█] [██████] [ 0 ] [████] [ 0 ] [██]
         ↓     ↓     ↓     ↓       ↓     ↓      ↓    ↓
        kept  GONE  kept  kept    GONE  kept  GONE  kept

L2 is a volume knob — turns everything down.

L1 is a kill switch — eliminates what's not needed.

The Mathematical Intuition

Why does this happen mathematically?

L2's Gradient

The gradient of L2 penalty with respect to a weight w:

∂(w²)/∂w = 2w

As w gets smaller, the gradient gets smaller. The force pushing w toward zero weakens as w approaches zero.

It's like pushing something toward a wall, but the closer it gets, the weaker you push. It never quite reaches the wall.

Force: ████████████  ← Strong push
       ↓
Weight: ████████
       ↓
Force: ████████      ← Getting weaker
       ↓
Weight: ████
       ↓
Force: ████          ← Even weaker
       ↓
Weight: ██
       ↓
Force: ██            ← Barely pushing
       ↓
Weight: █            ← Never reaches zero!

L1's Gradient

The gradient of L1 penalty with respect to a weight w:

∂|w|/∂w = +1  (if w > 0)
        = -1  (if w < 0)

The gradient is constant. The force pushing w toward zero is always the same strength, regardless of how small w gets.

It's like pushing something toward a wall with constant force. Eventually, it hits the wall.

Force: ████████████  ← Constant push
       ↓
Weight: ████████
       ↓
Force: ████████████  ← Same force!
       ↓
Weight: ████
       ↓
Force: ████████████  ← Still same force!
       ↓
Weight: ██
       ↓
Force: ████████████  ← SAME FORCE!
       ↓
Weight: 0            ← HIT THE WALL!

L1's constant gradient pushes weights all the way to zero.

L2's diminishing gradient lets weights approach zero asymptotically but never reach it.

When to Use L2 (Ridge)

Perfect for:

1. When all features probably matter

If you believe every feature contains some signal, L2 keeps them all but reduces their impact.

# Example: Predicting house prices
# sqft, bedrooms, bathrooms, location, age, condition...
# ALL of these probably matter somewhat

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)

2. When features are correlated

If you have features that move together (multicollinearity), L2 spreads the weight among them instead of picking one arbitrarily.

# Example: height_cm and height_inches are perfectly correlated
# L2 gives both some weight
# L1 might randomly zero out one of them

3. When you want stable predictions

L2 creates smoother models. Small changes in input create small changes in output.

4. When you have more features than samples (n > p) but believe in shared signal

L2 won't fail — it will share the weight across correlated features.

L2 in Code

# Sklearn
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)  # alpha = λ
model.fit(X_train, y_train)

# Neural Network (Keras)
from tensorflow.keras.regularizers import l2
from tensorflow.keras.layers import Dense

model.add(Dense(64, kernel_regularizer=l2(0.01)))

# PyTorch
optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.01)
# weight_decay IS L2 regularization

When to Use L1 (Lasso)

Perfect for:

1. When you suspect many features are useless

If you have 100 features but suspect only 10 matter, L1 will find them.

# Example: Gene expression data
# 20,000 genes, but maybe only 50 matter for this disease
# L1 finds the 50

from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)

2. When you need interpretability

Fewer features = easier to explain.

# Example: "The model uses 5 features: age, income, credit_score..."
# Much easier than "The model uses 500 features with tiny weights..."

3. When you want automatic feature selection

L1 is feature selection built into training. No separate step needed.

# Instead of:
# 1. Train model
# 2. Analyze feature importance
# 3. Remove unimportant features
# 4. Retrain

# Just do:
model = Lasso(alpha=0.1)
model.fit(X, y)
# Done! Useless features already have weight = 0

4. When you need a sparse model for production

Fewer features = faster predictions = smaller model.

# Features with weight = 0 don't need to be computed at inference time
# Huge speedup for high-dimensional data

L1 in Code

# Sklearn
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)  # alpha = λ
model.fit(X_train, y_train)

# Count non-zero features
n_features_used = np.sum(model.coef_ != 0)
print(f"Model uses {n_features_used} out of {len(model.coef_)} features")

# Neural Network (Keras)
from tensorflow.keras.regularizers import l1
from tensorflow.keras.layers import Dense

model.add(Dense(64, kernel_regularizer=l1(0.01)))

# PyTorch (requires manual implementation)
l1_lambda = 0.01
l1_norm = sum(p.abs().sum() for p in model.parameters())
loss = loss_fn(output, target) + l1_lambda * l1_norm

Elastic Net: The Best of Both Worlds

What if you want some feature selection (L1) AND some weight smoothing (L2)?

Elastic Net combines both:

Penalty = λ₁ × Σ|weights| + λ₂ × Σ(weights²)
          ↑                   ↑
         L1 part            L2 part

You control the mix with the l1_ratio parameter:

l1_ratio = 1.0 → Pure L1 (Lasso)
l1_ratio = 0.0 → Pure L2 (Ridge)
l1_ratio = 0.5 → Half and half

from sklearn.linear_model import ElasticNet

# More L1 (more sparsity)
model = ElasticNet(alpha=0.1, l1_ratio=0.8)

# More L2 (more smoothing)
model = ElasticNet(alpha=0.1, l1_ratio=0.2)

# Balanced
model = ElasticNet(alpha=0.1, l1_ratio=0.5)

When to Use Elastic Net

Many features, some useless, some correlated
L1 alone is too aggressive (eliminates too much)
L2 alone keeps too many features
You want some sparsity but also stability

The Decision Flowchart

Start
  │
  ▼
Do you suspect many features are useless?
  │
  ├─ YES → Do you have correlated features?
  │          │
  │          ├─ YES → Use ELASTIC NET
  │          │
  │          └─ NO → Use L1 (LASSO)
  │
  └─ NO → Do you believe all features contribute?
           │
           ├─ YES → Use L2 (RIDGE)
           │
           └─ UNSURE → Use ELASTIC NET

Real-World Scenarios

Scenario 1: Spam Detection with 10,000 Word Features

You're classifying emails. You extract 10,000 word features (TF-IDF).

Most words don't matter for spam. "Viagra" matters. "Meeting" doesn't.

Use: L1 (Lasso)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(penalty='l1', solver='saga', C=0.1)
model.fit(X_train, y_train)

# See which words matter
important_words = [word for word, coef in zip(words, model.coef_[0]) if coef != 0]
print(f"Spam indicators: {important_words[:10]}")

Scenario 2: House Price Prediction with 20 Features

You're predicting house prices. Features: sqft, bedrooms, bathrooms, location, age, school rating, etc.

All of these probably matter somewhat.

Use: L2 (Ridge)

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

Scenario 3: Gene Expression with 20,000 Genes

You're predicting disease outcome from gene expression.

Only a small subset of genes matter. But genes are often correlated (work in pathways).

Use: Elastic Net

from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=0.1, l1_ratio=0.7)  # Mostly L1, some L2
model.fit(X_train, y_train)

# Find important genes
important_genes = [gene for gene, coef in zip(genes, model.coef_) if coef != 0]

Scenario 4: Deep Neural Network

You're training a CNN for image classification.

Neural networks have millions of weights. True sparsity is less important than preventing overfitting.

Use: L2 (Weight Decay)

# PyTorch
optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.01)

# Keras
from tensorflow.keras.regularizers import l2
model.add(Conv2D(32, (3,3), kernel_regularizer=l2(0.01)))

(Dropout is often more effective than L1/L2 for neural networks)

The Comparison Table

Aspect	L1 (Lasso)	L2 (Ridge)
Penalty	Sum of \	weights\
Effect on weights	Some → exactly 0	All → smaller
Feature selection	Built-in	No
Correlated features	Picks one randomly	Spreads weight
Interpretability	Higher (fewer features)	Lower (all features)
Sparsity	Yes	No
Computation	Can be slower	Generally faster
Default choice	When features might be useless	When all features matter

Visualizing the Effect on Model Complexity

                    Regularization Strength (λ)
                    Low ──────────────────► High

L2 (Ridge):
Features:     [●][●][●][●][●]   →   [•][•][•][•][•]
               All present           All present
               Full strength         All weakened

L1 (Lasso):
Features:     [●][●][●][●][●]   →   [●][ ][ ][●][ ]
               All present           Some eliminated
               Full strength         Survivors strong

The Suitcase Analogy Revisited

Approach	Suitcase Strategy	Model Strategy
L1 (Minimalist)	Leave items behind entirely	Zero out weights completely
L2 (Diplomat)	Take everything, compress each item	Shrink all weights
Elastic Net	Mix of both	Some zeros, some shrunk

Common Mistakes

Mistake 1: Using L1 When Features Are Correlated

Problem:  Features A and B are correlated (both important)
L1 does:  Randomly zeros one, keeps the other
You want: Both to be included with shared weight

Solution: Use L2 or Elastic Net

Mistake 2: Using L2 When You Need Interpretability

Problem:  Your model has 1000 features with tiny weights
Stakeholder: "Which features matter?"
You: "Uh... all of them... a little?"

Solution: Use L1 to identify key features

Mistake 3: Not Scaling Features Before Regularization

# WRONG: Features on different scales
# Large features get penalized more!
model.fit(X_raw, y)  # X has features from 0-1 and 0-1000000

# RIGHT: Scale first
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
model.fit(X_scaled, y)

Mistake 4: Using the Same λ for L1 and L2

L1 is more aggressive. The same alpha value will have different effects.

# These are NOT equivalent:
ridge = Ridge(alpha=1.0)    # Moderate shrinkage
lasso = Lasso(alpha=1.0)    # Aggressive elimination

# You might need:
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1)    # Much smaller alpha for similar effect

Finding the Right Alpha (λ)

Use cross-validation:

from sklearn.linear_model import RidgeCV, LassoCV

# L2: Find best alpha automatically
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")

# L1: Find best alpha automatically
lasso_cv = LassoCV(alphas=[0.001, 0.01, 0.1, 1.0], cv=5)
lasso_cv.fit(X_train, y_train)
print(f"Best Lasso alpha: {lasso_cv.alpha_}")
print(f"Features used: {np.sum(lasso_cv.coef_ != 0)} / {len(lasso_cv.coef_)}")

Key Takeaways

L1 (Lasso) = Zeros out weights → Feature selection → Sparse models
L2 (Ridge) = Shrinks all weights → Keeps all features → Smooth models
L1 is the Minimalist — Eliminates what you don't need
L2 is the Diplomat — Keeps everything but turns down the volume
Use L1 when many features are probably useless
Use L2 when all features probably contribute
Use Elastic Net when unsure or features are correlated
Always scale features before regularization

The One-Sentence Summary

L1 asks "Do I need this?" and throws away the "no"s. L2 asks "How much do I need this?" and turns everything down proportionally.

What's Next?

Now that you understand L1 vs L2, you're ready for:

Feature Selection Methods — Beyond L1
Cross-Validation — Finding optimal regularization strength
Elastic Net Deep Dive — Combining L1 and L2 optimally
Regularization in Neural Networks — Dropout, weight decay, and more

Follow me for the next article in this series!

Let's Connect!

If this finally made L1 vs L2 click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Which do you use more often, L1 or L2? I'm curious!

The difference between a model with 1000 confusing features and one with 10 clear features? Often just switching from L2 to L1. Know your regularizers.

Share this with someone who's confused about when to use Ridge vs Lasso. The suitcase analogy might be exactly what they need.

Happy learning!