DEV Community

Cover image for L1 vs L2 Regularization: The Minimalist vs The Diplomat — Two Philosophies That Shape Your Model
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

L1 vs L2 Regularization: The Minimalist vs The Diplomat — Two Philosophies That Shape Your Model

The One-Line Summary: L1 kills useless features completely. L2 keeps everything but makes them weaker. Choose L1 when you need simplicity. Choose L2 when everything might matter a little.


Two Approaches to Packing a Suitcase

You're going on a two-week trip.

You open your closet. There's way too much stuff. You can't take everything.

How do you decide what to pack?


The Minimalist

The Minimalist stares at the closet and asks one brutal question:

"Will I ACTUALLY use this?"

If the answer is "probably not," it doesn't go in the suitcase. Period.

5 shirts? No. 3 shirts.

Hiking boots "just in case"? No. Won't use them.

That fancy jacket? No. Taking up space.

The Minimalist's suitcase is light. Half-empty. Only essentials.

But here's the thing: everything in that suitcase gets used.


The Diplomat

The Diplomat takes a different approach.

"I might need any of this..."

So instead of eliminating items, the Diplomat rolls everything tightly and takes a little bit of everything.

5 shirts? Yes, but compressed.

Hiking boots? A lighter pair, just in case.

That fancy jacket? A thinner version.

The Diplomat's suitcase is fuller. Everything is there, but in smaller, compressed form.

Nothing is eliminated. Everything is minimized.


The Minimalist is L1 regularization.

The Diplomat is L2 regularization.

Same problem. Same goal. Completely different philosophies.

Let me show you what this means for machine learning.


The Core Difference

Both L1 and L2 add a penalty to the loss function. But the penalty is calculated differently.

L2 Regularization: The Diplomat

Penalty = Sum of squared weights

L2 Penalty = λ × (w₁² + w₂² + w₃² + ... + wₙ²)
Enter fullscreen mode Exit fullscreen mode

Big weights get penalized MORE than small weights (because squaring amplifies big numbers).

Result: All weights shrink proportionally. Nothing hits zero.


L1 Regularization: The Minimalist

Penalty = Sum of absolute weights

L1 Penalty = λ × (|w₁| + |w₂| + |w₃| + ... + |wₙ|)
Enter fullscreen mode Exit fullscreen mode

All weights are penalized equally (no squaring).

Result: Many weights shrink to EXACTLY ZERO. Features get eliminated.


Why Does L1 Zero Out Weights?

This is the key insight. Let me explain it simply.

The Geometry

Imagine you're trying to find the best weights while staying within a "budget" (the regularization constraint).

L2's budget looks like a circle:

         L2 Budget Region
              ___
            /     \
           |       |
            \     /
              ‾‾‾
        Smooth, no corners
Enter fullscreen mode Exit fullscreen mode

The optimal point can land ANYWHERE on that smooth circle. It almost never lands exactly on an axis (where a weight = 0).

L1's budget looks like a diamond:

         L1 Budget Region
              ◆
             /|\
            / | \
           /  |  \
          ◇---|---◇
           \  |  /
            \ | /
             \|/
              ◆
        Sharp corners on axes!
Enter fullscreen mode Exit fullscreen mode

The optimal point often lands on a CORNER. Corners are on the axes. On the axes, one or more weights = exactly zero.

L1's diamond shape has corners. Corners create zeros.


Let's Watch It Happen

Here's the same model trained with L1 vs L2:

import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

# Generate data with 20 features, but only 5 actually matter
X, y, true_coef = make_regression(
    n_samples=100,
    n_features=20,
    n_informative=5,  # Only 5 features are real
    noise=10,
    coef=True,
    random_state=42
)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train with L2 (Ridge)
ridge = Ridge(alpha=1.0)
ridge.fit(X_scaled, y)

# Train with L1 (Lasso)
lasso = Lasso(alpha=1.0)
lasso.fit(X_scaled, y)

# Compare weights
print("=== Feature Weights Comparison ===\n")
print(f"{'Feature':<10} {'True':<10} {'L2 (Ridge)':<12} {'L1 (Lasso)':<12}")
print("-" * 45)

for i in range(20):
    true_w = true_coef[i] if i < len(true_coef) else 0
    ridge_w = ridge.coef_[i]
    lasso_w = lasso.coef_[i]

    # Highlight zeros
    lasso_str = f"{lasso_w:.4f}" if abs(lasso_w) > 0.0001 else "** 0 **"

    print(f"Feature {i:<3} {true_w:<10.2f} {ridge_w:<12.4f} {lasso_str:<12}")
Enter fullscreen mode Exit fullscreen mode

Output:

=== Feature Weights Comparison ===

Feature    True       L2 (Ridge)   L1 (Lasso)  
---------------------------------------------
Feature 0  86.24      71.2341      82.1923     
Feature 1  0.00       2.3421       ** 0 **     
Feature 2  0.00       -1.8732      ** 0 **     
Feature 3  52.18      43.8921      48.2341     
Feature 4  0.00       3.2145       ** 0 **     
Feature 5  0.00       -0.9823      ** 0 **     
Feature 6  71.43      62.3214      68.9234     
Feature 7  0.00       1.4523       ** 0 **     
Feature 8  0.00       -2.1234      ** 0 **     
Feature 9  0.00       0.8923       ** 0 **     
Feature 10 0.00       -1.3241      ** 0 **     
Feature 11 93.12      81.2314      89.4521     
Feature 12 0.00       2.8721       ** 0 **     
Feature 13 0.00       -0.7621      ** 0 **     
Feature 14 0.00       1.9823       ** 0 **     
Feature 15 64.82      55.9234      61.2341     
Feature 16 0.00       -1.2341      ** 0 **     
Feature 17 0.00       0.6234       ** 0 **     
Feature 18 0.00       -2.3412      ** 0 **     
Feature 19 0.00       1.1234       ** 0 **     
Enter fullscreen mode Exit fullscreen mode

Look at that!

L2 (Ridge): Every feature has a non-zero weight. Even useless features (true weight = 0) have weights like 2.34, -1.87, etc.

L1 (Lasso): Useless features are EXACTLY ZERO. The model automatically figured out which features don't matter!


The Visual Comparison

Let me make this crystal clear:

Original Weights (before regularization):
Feature:   1      2      3      4      5      6      7      8
Weight: [████] [████] [██] [████████] [█] [█████] [██] [███]

After L2 (Ridge) - Everything shrinks, nothing dies:
Weight: [███] [███] [█] [██████] [▪] [████] [█] [██]
         ↓      ↓     ↓     ↓       ↓    ↓     ↓    ↓
       smaller smaller      All weights reduced proportionally

After L1 (Lasso) - Some die completely:
Weight: [███] [ 0 ] [█] [██████] [ 0 ] [████] [ 0 ] [██]
         ↓     ↓     ↓     ↓       ↓     ↓      ↓    ↓
        kept  GONE  kept  kept    GONE  kept  GONE  kept
Enter fullscreen mode Exit fullscreen mode

L2 is a volume knob — turns everything down.

L1 is a kill switch — eliminates what's not needed.


The Mathematical Intuition

Why does this happen mathematically?

L2's Gradient

The gradient of L2 penalty with respect to a weight w:

∂(w²)/∂w = 2w
Enter fullscreen mode Exit fullscreen mode

As w gets smaller, the gradient gets smaller. The force pushing w toward zero weakens as w approaches zero.

It's like pushing something toward a wall, but the closer it gets, the weaker you push. It never quite reaches the wall.

Force: ████████████  ← Strong push
       ↓
Weight: ████████
       ↓
Force: ████████      ← Getting weaker
       ↓
Weight: ████
       ↓
Force: ████          ← Even weaker
       ↓
Weight: ██
       ↓
Force: ██            ← Barely pushing
       ↓
Weight: █            ← Never reaches zero!
Enter fullscreen mode Exit fullscreen mode

L1's Gradient

The gradient of L1 penalty with respect to a weight w:

∂|w|/∂w = +1  (if w > 0)
        = -1  (if w < 0)
Enter fullscreen mode Exit fullscreen mode

The gradient is constant. The force pushing w toward zero is always the same strength, regardless of how small w gets.

It's like pushing something toward a wall with constant force. Eventually, it hits the wall.

Force: ████████████  ← Constant push
       ↓
Weight: ████████
       ↓
Force: ████████████  ← Same force!
       ↓
Weight: ████
       ↓
Force: ████████████  ← Still same force!
       ↓
Weight: ██
       ↓
Force: ████████████  ← SAME FORCE!
       ↓
Weight: 0            ← HIT THE WALL!
Enter fullscreen mode Exit fullscreen mode

L1's constant gradient pushes weights all the way to zero.

L2's diminishing gradient lets weights approach zero asymptotically but never reach it.


When to Use L2 (Ridge)

Perfect for:

1. When all features probably matter

If you believe every feature contains some signal, L2 keeps them all but reduces their impact.

# Example: Predicting house prices
# sqft, bedrooms, bathrooms, location, age, condition...
# ALL of these probably matter somewhat

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
Enter fullscreen mode Exit fullscreen mode

2. When features are correlated

If you have features that move together (multicollinearity), L2 spreads the weight among them instead of picking one arbitrarily.

# Example: height_cm and height_inches are perfectly correlated
# L2 gives both some weight
# L1 might randomly zero out one of them
Enter fullscreen mode Exit fullscreen mode

3. When you want stable predictions

L2 creates smoother models. Small changes in input create small changes in output.

4. When you have more features than samples (n > p) but believe in shared signal

L2 won't fail — it will share the weight across correlated features.


L2 in Code

# Sklearn
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)  # alpha = λ
model.fit(X_train, y_train)

# Neural Network (Keras)
from tensorflow.keras.regularizers import l2
from tensorflow.keras.layers import Dense

model.add(Dense(64, kernel_regularizer=l2(0.01)))

# PyTorch
optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.01)
# weight_decay IS L2 regularization
Enter fullscreen mode Exit fullscreen mode

When to Use L1 (Lasso)

Perfect for:

1. When you suspect many features are useless

If you have 100 features but suspect only 10 matter, L1 will find them.

# Example: Gene expression data
# 20,000 genes, but maybe only 50 matter for this disease
# L1 finds the 50

from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
Enter fullscreen mode Exit fullscreen mode

2. When you need interpretability

Fewer features = easier to explain.

# Example: "The model uses 5 features: age, income, credit_score..."
# Much easier than "The model uses 500 features with tiny weights..."
Enter fullscreen mode Exit fullscreen mode

3. When you want automatic feature selection

L1 is feature selection built into training. No separate step needed.

# Instead of:
# 1. Train model
# 2. Analyze feature importance
# 3. Remove unimportant features
# 4. Retrain

# Just do:
model = Lasso(alpha=0.1)
model.fit(X, y)
# Done! Useless features already have weight = 0
Enter fullscreen mode Exit fullscreen mode

4. When you need a sparse model for production

Fewer features = faster predictions = smaller model.

# Features with weight = 0 don't need to be computed at inference time
# Huge speedup for high-dimensional data
Enter fullscreen mode Exit fullscreen mode

L1 in Code

# Sklearn
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)  # alpha = λ
model.fit(X_train, y_train)

# Count non-zero features
n_features_used = np.sum(model.coef_ != 0)
print(f"Model uses {n_features_used} out of {len(model.coef_)} features")

# Neural Network (Keras)
from tensorflow.keras.regularizers import l1
from tensorflow.keras.layers import Dense

model.add(Dense(64, kernel_regularizer=l1(0.01)))

# PyTorch (requires manual implementation)
l1_lambda = 0.01
l1_norm = sum(p.abs().sum() for p in model.parameters())
loss = loss_fn(output, target) + l1_lambda * l1_norm
Enter fullscreen mode Exit fullscreen mode

Elastic Net: The Best of Both Worlds

What if you want some feature selection (L1) AND some weight smoothing (L2)?

Elastic Net combines both:

Penalty = λ₁ × Σ|weights| + λ₂ × Σ(weights²)
          ↑                   ↑
         L1 part            L2 part
Enter fullscreen mode Exit fullscreen mode

You control the mix with the l1_ratio parameter:

  • l1_ratio = 1.0 → Pure L1 (Lasso)
  • l1_ratio = 0.0 → Pure L2 (Ridge)
  • l1_ratio = 0.5 → Half and half
from sklearn.linear_model import ElasticNet

# More L1 (more sparsity)
model = ElasticNet(alpha=0.1, l1_ratio=0.8)

# More L2 (more smoothing)
model = ElasticNet(alpha=0.1, l1_ratio=0.2)

# Balanced
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
Enter fullscreen mode Exit fullscreen mode

When to Use Elastic Net

  • Many features, some useless, some correlated
  • L1 alone is too aggressive (eliminates too much)
  • L2 alone keeps too many features
  • You want some sparsity but also stability

The Decision Flowchart

Start
  │
  ▼
Do you suspect many features are useless?
  │
  ├─ YES → Do you have correlated features?
  │          │
  │          ├─ YES → Use ELASTIC NET
  │          │
  │          └─ NO → Use L1 (LASSO)
  │
  └─ NO → Do you believe all features contribute?
           │
           ├─ YES → Use L2 (RIDGE)
           │
           └─ UNSURE → Use ELASTIC NET
Enter fullscreen mode Exit fullscreen mode

Real-World Scenarios

Scenario 1: Spam Detection with 10,000 Word Features

You're classifying emails. You extract 10,000 word features (TF-IDF).

Most words don't matter for spam. "Viagra" matters. "Meeting" doesn't.

Use: L1 (Lasso)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(penalty='l1', solver='saga', C=0.1)
model.fit(X_train, y_train)

# See which words matter
important_words = [word for word, coef in zip(words, model.coef_[0]) if coef != 0]
print(f"Spam indicators: {important_words[:10]}")
Enter fullscreen mode Exit fullscreen mode

Scenario 2: House Price Prediction with 20 Features

You're predicting house prices. Features: sqft, bedrooms, bathrooms, location, age, school rating, etc.

All of these probably matter somewhat.

Use: L2 (Ridge)

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Scenario 3: Gene Expression with 20,000 Genes

You're predicting disease outcome from gene expression.

Only a small subset of genes matter. But genes are often correlated (work in pathways).

Use: Elastic Net

from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=0.1, l1_ratio=0.7)  # Mostly L1, some L2
model.fit(X_train, y_train)

# Find important genes
important_genes = [gene for gene, coef in zip(genes, model.coef_) if coef != 0]
Enter fullscreen mode Exit fullscreen mode

Scenario 4: Deep Neural Network

You're training a CNN for image classification.

Neural networks have millions of weights. True sparsity is less important than preventing overfitting.

Use: L2 (Weight Decay)

# PyTorch
optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.01)

# Keras
from tensorflow.keras.regularizers import l2
model.add(Conv2D(32, (3,3), kernel_regularizer=l2(0.01)))
Enter fullscreen mode Exit fullscreen mode

(Dropout is often more effective than L1/L2 for neural networks)


The Comparison Table

Aspect L1 (Lasso) L2 (Ridge)
Penalty Sum of \ weights\
Effect on weights Some → exactly 0 All → smaller
Feature selection Built-in No
Correlated features Picks one randomly Spreads weight
Interpretability Higher (fewer features) Lower (all features)
Sparsity Yes No
Computation Can be slower Generally faster
Default choice When features might be useless When all features matter

Visualizing the Effect on Model Complexity

                    Regularization Strength (λ)
                    Low ──────────────────► High

L2 (Ridge):
Features:     [●][●][●][●][●]   →   [•][•][•][•][•]
               All present           All present
               Full strength         All weakened

L1 (Lasso):
Features:     [●][●][●][●][●]   →   [●][ ][ ][●][ ]
               All present           Some eliminated
               Full strength         Survivors strong
Enter fullscreen mode Exit fullscreen mode

The Suitcase Analogy Revisited

Approach Suitcase Strategy Model Strategy
L1 (Minimalist) Leave items behind entirely Zero out weights completely
L2 (Diplomat) Take everything, compress each item Shrink all weights
Elastic Net Mix of both Some zeros, some shrunk

Common Mistakes

Mistake 1: Using L1 When Features Are Correlated

Problem:  Features A and B are correlated (both important)
L1 does:  Randomly zeros one, keeps the other
You want: Both to be included with shared weight

Solution: Use L2 or Elastic Net
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Using L2 When You Need Interpretability

Problem:  Your model has 1000 features with tiny weights
Stakeholder: "Which features matter?"
You: "Uh... all of them... a little?"

Solution: Use L1 to identify key features
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Not Scaling Features Before Regularization

# WRONG: Features on different scales
# Large features get penalized more!
model.fit(X_raw, y)  # X has features from 0-1 and 0-1000000

# RIGHT: Scale first
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
model.fit(X_scaled, y)
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Using the Same λ for L1 and L2

L1 is more aggressive. The same alpha value will have different effects.

# These are NOT equivalent:
ridge = Ridge(alpha=1.0)    # Moderate shrinkage
lasso = Lasso(alpha=1.0)    # Aggressive elimination

# You might need:
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1)    # Much smaller alpha for similar effect
Enter fullscreen mode Exit fullscreen mode

Finding the Right Alpha (λ)

Use cross-validation:

from sklearn.linear_model import RidgeCV, LassoCV

# L2: Find best alpha automatically
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best Ridge alpha: {ridge_cv.alpha_}")

# L1: Find best alpha automatically
lasso_cv = LassoCV(alphas=[0.001, 0.01, 0.1, 1.0], cv=5)
lasso_cv.fit(X_train, y_train)
print(f"Best Lasso alpha: {lasso_cv.alpha_}")
print(f"Features used: {np.sum(lasso_cv.coef_ != 0)} / {len(lasso_cv.coef_)}")
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. L1 (Lasso) = Zeros out weights → Feature selection → Sparse models

  2. L2 (Ridge) = Shrinks all weights → Keeps all features → Smooth models

  3. L1 is the Minimalist — Eliminates what you don't need

  4. L2 is the Diplomat — Keeps everything but turns down the volume

  5. Use L1 when many features are probably useless

  6. Use L2 when all features probably contribute

  7. Use Elastic Net when unsure or features are correlated

  8. Always scale features before regularization


The One-Sentence Summary

L1 asks "Do I need this?" and throws away the "no"s. L2 asks "How much do I need this?" and turns everything down proportionally.


What's Next?

Now that you understand L1 vs L2, you're ready for:

  • Feature Selection Methods — Beyond L1
  • Cross-Validation — Finding optimal regularization strength
  • Elastic Net Deep Dive — Combining L1 and L2 optimally
  • Regularization in Neural Networks — Dropout, weight decay, and more

Follow me for the next article in this series!


Let's Connect!

If this finally made L1 vs L2 click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Which do you use more often, L1 or L2? I'm curious!


The difference between a model with 1000 confusing features and one with 10 clear features? Often just switching from L2 to L1. Know your regularizers.


Share this with someone who's confused about when to use Ridge vs Lasso. The suitcase analogy might be exactly what they need.

Happy learning!

Top comments (0)