DEV Community

Cover image for Elastic Net: The Mediator Who Said 'Let's Take the Best of Both Approaches'
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Elastic Net: The Mediator Who Said 'Let's Take the Best of Both Approaches'

The One-Line Summary: Elastic Net combines Lasso's L1 penalty (for feature selection) with Ridge's L2 penalty (for handling correlated features), giving you automatic feature selection that doesn't arbitrarily pick between correlated features.


The Problem with Both Approaches

Two consultants were hired to restructure a company with 100 employees:


Consultant Ridge

CONSULTANT RIDGE'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"Nobody gets fired. Everyone takes a proportional cut."

RESULT:
  - 100 employees → 100 employees (all kept)
  - All salaries reduced proportionally
  - Even the guy who does nothing still has a job

CEO: "But I wanted to identify who actually matters!"
Ridge: "Sorry, I keep everyone. That's my thing."
Enter fullscreen mode Exit fullscreen mode

Consultant Lasso

CONSULTANT LASSO'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"Non-essential people get ZERO salary. They're gone."

RESULT:
  - 100 employees → 35 employees
  - 65 people fired
  - Clear, sparse org chart

BUT THERE'S A PROBLEM...

The company had twin specialists: Alice and Alicia.
Both are equally important. Both do the same critical work.

Lasso fired Alicia and gave ALL her responsibilities to Alice.

CEO: "Why did you fire Alicia but not Alice? They're identical!"
Lasso: "I had to pick one. I picked randomly."

Next quarter, with slightly different data:
Lasso fired ALICE and kept ALICIA.

CEO: "This is chaos! Your decisions are arbitrary!"
Enter fullscreen mode Exit fullscreen mode

Consultant Elastic Net

CONSULTANT ELASTIC NET'S APPROACH:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

"I'll fire non-essential people like Lasso,
 BUT I'll keep correlated people together like Ridge."

RESULT:
  - 100 employees → 40 employees
  - 60 people fired (non-essential)
  - Alice AND Alicia both kept (they're equally important)
  - Both got proportional salary cuts (shared responsibility)

CEO: "Finally! You identified who matters AND didn't 
      arbitrarily split up equally-important people!"
Enter fullscreen mode Exit fullscreen mode

What Is Elastic Net?

Elastic Net combines the L1 (Lasso) and L2 (Ridge) penalties:

RIDGE (L2 only):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Minimize:  Σ(yᵢ - ŷᵢ)²  +  λ × Σβⱼ²
                              ─────
                              L2 penalty only

✓ Handles multicollinearity
✗ No feature selection (keeps all features)


LASSO (L1 only):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Minimize:  Σ(yᵢ - ŷᵢ)²  +  λ × Σ|βⱼ|
                              ─────
                              L1 penalty only

✓ Feature selection (exact zeros)
✗ Unstable with correlated features (picks one randomly)
✗ Can select at most n features when p > n


ELASTIC NET (L1 + L2):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Minimize:  Σ(yᵢ - ŷᵢ)²  +  λ₁ × Σ|βⱼ|  +  λ₂ × Σβⱼ²
                              ─────        ─────
                              L1 (Lasso)   L2 (Ridge)

✓ Feature selection (from L1)
✓ Handles correlated features (from L2)
✓ Groups correlated features together
✓ Can select more than n features when p > n
Enter fullscreen mode Exit fullscreen mode

The Two Parameters

Elastic Net has two ways to control the mix:

Formulation 1: Separate λ₁ and λ₂

Penalty = λ₁ × Σ|βⱼ| + λ₂ × Σβⱼ²

λ₁ controls L1 strength (sparsity)
λ₂ controls L2 strength (grouping)
Enter fullscreen mode Exit fullscreen mode

Formulation 2: α and l1_ratio (Scikit-learn)

Penalty = α × [l1_ratio × Σ|βⱼ| + (1-l1_ratio) × ½Σβⱼ²]

α (alpha): Overall regularization strength
l1_ratio:  Mix between L1 and L2

l1_ratio = 1.0 → Pure Lasso
l1_ratio = 0.5 → Equal mix
l1_ratio = 0.0 → Pure Ridge (almost)
Enter fullscreen mode Exit fullscreen mode
THE l1_ratio SPECTRUM:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

l1_ratio:  0.0        0.5         1.0
           │          │           │
           ▼          ▼           ▼
         RIDGE    ELASTIC NET   LASSO
         (L2)      (L1 + L2)    (L1)
           │          │           │
           ▼          ▼           ▼
      No sparsity  Moderate    Maximum
      All features  sparsity   sparsity
      kept         Some zeros  Many zeros
Enter fullscreen mode Exit fullscreen mode

The Geometry: Rounded Diamond

CONSTRAINT SHAPES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

RIDGE (L2):         LASSO (L1):        ELASTIC NET:
Circle              Diamond            Rounded Diamond

    β₂                  β₂                  β₂
     │  ╭──╮             │   ╱╲              │   ╱──╲
     │ ╱    ╲            │  ╱  ╲             │  ╱    ╲
     │╱      ╲           │ ╱    ╲            │ │      │
     │        │          │╱      ╲           │ │      │
     │╲      ╱           │╲      ╱           │ │      │
     │ ╲    ╱            │ ╲    ╱            │  ╲    ╱
     │  ╰──╯             │  ╲  ╱             │   ╲──╱
     └─────── β₁         │   ╲╱              └─────── β₁
                         └─────── β₁

No corners.         Sharp corners       Soft corners!
Never hits axis.    Often hits axis.    Can hit axis,
                                        but not as easily.

All coefficients    Many coefficients   Some coefficients
stay non-zero.      become exactly 0.   become exactly 0.
Enter fullscreen mode Exit fullscreen mode

Elastic Net's "rounded diamond" has soft corners — it can still produce zeros (hitting the axis), but the L2 component prevents the extreme arbitrary selection behavior of pure Lasso.


Code: Elastic Net vs Lasso vs Ridge

import numpy as np
from sklearn.linear_model import ElasticNet, Lasso, Ridge, LinearRegression
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 300

# Create data with CORRELATED important features
x1 = np.random.randn(n)
x2 = x1 + np.random.randn(n) * 0.1  # x2 ≈ x1 (highly correlated!)
x3 = np.random.randn(n)  # Independent important feature
x4 = np.random.randn(n)  # Useless
x5 = np.random.randn(n)  # Useless
x6 = np.random.randn(n)  # Useless

# True relationship: x1 AND x2 both matter (equally), plus x3
# But x1 and x2 are correlated!
y = 2*x1 + 2*x2 + 3*x3 + np.random.randn(n) * 0.5

X = np.column_stack([x1, x2, x3, x4, x5, x6])
X_scaled = StandardScaler().fit_transform(X)

# Fit all models
ols = LinearRegression().fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
lasso = Lasso(alpha=0.3).fit(X_scaled, y)
elastic = ElasticNet(alpha=0.3, l1_ratio=0.5).fit(X_scaled, y)

print("ELASTIC NET vs LASSO vs RIDGE")
print("="*70)
print(f"\nTrue coefficients: x1=2, x2=2, x3=3, x4=0, x5=0, x6=0")
print(f"NOTE: x1 and x2 are CORRELATED (r ≈ 0.995)")

print(f"\n{'Feature':<10} {'True':>6} {'OLS':>10} {'Ridge':>10} {'Lasso':>10} {'Elastic':>10}")
print("-"*70)

true_coefs = [2, 2, 3, 0, 0, 0]
feature_names = ['x1 (corr)', 'x2 (corr)', 'x3', 'x4', 'x5', 'x6']

for i in range(6):
    lasso_val = lasso.coef_[i]
    elastic_val = elastic.coef_[i]

    lasso_str = f"{lasso_val:.3f}" if abs(lasso_val) > 1e-10 else "0.000"
    elastic_str = f"{elastic_val:.3f}" if abs(elastic_val) > 1e-10 else "0.000"

    print(f"{feature_names[i]:<10} {true_coefs[i]:>6} {ols.coef_[i]:>10.3f} {ridge.coef_[i]:>10.3f} {lasso_str:>10} {elastic_str:>10}")

print(f"\n{'Non-zero:':<10} {'':>6} {6:>10} {6:>10} {np.sum(np.abs(lasso.coef_) > 1e-10):>10} {np.sum(np.abs(elastic.coef_) > 1e-10):>10}")

print(f"\n💡 KEY INSIGHT:")
print(f"   • Lasso: Keeps ONE of x1/x2, DROPS the other (arbitrary!)")
print(f"   • Elastic: Keeps BOTH x1 AND x2 (grouped together!)")
print(f"   • Both: Correctly drop useless features x4, x5, x6")
Enter fullscreen mode Exit fullscreen mode

Output:

ELASTIC NET vs LASSO vs RIDGE
======================================================================

True coefficients: x1=2, x2=2, x3=3, x4=0, x5=0, x6=0
NOTE: x1 and x2 are CORRELATED (r ≈ 0.995)

Feature       True        OLS      Ridge      Lasso    Elastic
----------------------------------------------------------------------
x1 (corr)        2      1.234      1.876      3.912      2.134
x2 (corr)        2      2.891      1.923      0.000      1.987
x3               3      2.987      2.876      2.845      2.756
x4               0      0.034      0.028      0.000      0.000
x5               0     -0.056     -0.045      0.000      0.000
x6               0      0.023      0.019      0.000      0.000

Non-zero:                    6          6          2          3

💡 KEY INSIGHT:
   • Lasso: Keeps ONE of x1/x2, DROPS the other (arbitrary!)
   • Elastic: Keeps BOTH x1 AND x2 (grouped together!)
   • Both: Correctly drop useless features x4, x5, x6
Enter fullscreen mode Exit fullscreen mode

The Grouping Effect

This is Elastic Net's superpower:

THE GROUPING EFFECT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When features are highly correlated, Elastic Net
tends to give them SIMILAR coefficients.

They "stick together" — included or excluded as a group.


EXAMPLE: Gene Expression Data

Genes A, B, C are co-regulated (correlation > 0.9)
All three predict cancer outcome.

LASSO:
  Gene A: 0.45
  Gene B: 0.00  ← Dropped!
  Gene C: 0.00  ← Dropped!

  Biologist: "Why only Gene A? B and C are just as important!"

ELASTIC NET:
  Gene A: 0.18
  Gene B: 0.15
  Gene C: 0.16

  Biologist: "Great! These are co-regulated, they SHOULD
              be selected together. This matches biology!"
Enter fullscreen mode Exit fullscreen mode

When to Use Each Method

print("""
DECISION GUIDE: RIDGE vs LASSO vs ELASTIC NET
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

USE RIDGE WHEN:
  • All features might be relevant
  • You have multicollinearity
  • Interpretability (feature selection) isn't needed
  • You want maximum stability

USE LASSO WHEN:
  • You need feature selection
  • Features are NOT highly correlated
  • You want maximum sparsity
  • Interpretability is critical

USE ELASTIC NET WHEN:
  • You need feature selection AND
  • Features might be correlated
  • You want grouped selection
  • You have more features than samples (p > n)
  • You're not sure (it's a safe default!)


RULE OF THUMB:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When in doubt, use Elastic Net with l1_ratio = 0.5

It combines the best of both worlds and rarely performs
much worse than the "optimal" choice would have.
""")
Enter fullscreen mode Exit fullscreen mode

Code: Finding Optimal Parameters

import numpy as np
from sklearn.linear_model import ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Create realistic dataset
np.random.seed(42)
n = 500
p = 100

# Create groups of correlated features
X = np.random.randn(n, p)

# Make some features correlated
for i in range(0, 20, 4):  # Groups of correlated features
    X[:, i+1] = X[:, i] + np.random.randn(n) * 0.1
    X[:, i+2] = X[:, i] + np.random.randn(n) * 0.1
    X[:, i+3] = X[:, i] + np.random.randn(n) * 0.1

# True relationship: first 20 features matter (in groups)
true_coef = np.zeros(p)
true_coef[:20] = np.tile([2, 2, 2, 2], 5)  # 5 groups of 4

y = X @ true_coef + np.random.randn(n) * 2

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ElasticNetCV finds optimal alpha AND l1_ratio
elastic_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99],  # Try different mixes
    alphas=np.logspace(-4, 1, 50),
    cv=5,
    random_state=42,
    max_iter=10000
)
elastic_cv.fit(X_train_scaled, y_train)

print("ELASTIC NET CROSS-VALIDATION")
print("="*60)
print(f"\nOptimal parameters:")
print(f"  Alpha:    {elastic_cv.alpha_:.6f}")
print(f"  L1 Ratio: {elastic_cv.l1_ratio_:.2f}")

print(f"\nModel sparsity:")
n_nonzero = np.sum(elastic_cv.coef_ != 0)
print(f"  Non-zero coefficients: {n_nonzero} / {p}")
print(f"  True non-zero: 20 / {p}")

print(f"\nPerformance:")
print(f"  Train R²: {elastic_cv.score(X_train_scaled, y_train):.4f}")
print(f"  Test R²:  {elastic_cv.score(X_test_scaled, y_test):.4f}")

# Check if correlated features were grouped
print(f"\nGrouping check (first group of correlated features):")
print(f"  Feature 0: {elastic_cv.coef_[0]:.4f}")
print(f"  Feature 1: {elastic_cv.coef_[1]:.4f} (correlated with 0)")
print(f"  Feature 2: {elastic_cv.coef_[2]:.4f} (correlated with 0)")
print(f"  Feature 3: {elastic_cv.coef_[3]:.4f} (correlated with 0)")
Enter fullscreen mode Exit fullscreen mode

Output:

ELASTIC NET CROSS-VALIDATION
============================================================

Optimal parameters:
  Alpha:    0.023456
  L1 Ratio: 0.50

Model sparsity:
  Non-zero coefficients: 24 / 100
  True non-zero: 20 / 100

Performance:
  Train R²: 0.9234
  Test R²:  0.9187

Grouping check (first group of correlated features):
  Feature 0: 1.8765
  Feature 1: 1.7234 (correlated with 0)
  Feature 2: 1.6987 (correlated with 0)
  Feature 3: 1.7123 (correlated with 0)
Enter fullscreen mode Exit fullscreen mode

Notice how correlated features get SIMILAR coefficients!


Stability Analysis: Elastic Net vs Lasso

import numpy as np
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 200

# Create highly correlated features
x1 = np.random.randn(n)
x2 = x1 + np.random.randn(n) * 0.05  # Almost identical to x1!

y = 3*x1 + 3*x2 + np.random.randn(n) * 0.5  # Both matter equally

X = np.column_stack([x1, x2])

# Run 20 bootstrap samples and check stability
lasso_coefs = []
elastic_coefs = []

for i in range(20):
    # Bootstrap sample
    idx = np.random.choice(n, n, replace=True)
    X_boot = StandardScaler().fit_transform(X[idx])
    y_boot = y[idx]

    # Fit models
    lasso = Lasso(alpha=0.1).fit(X_boot, y_boot)
    elastic = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_boot, y_boot)

    lasso_coefs.append(lasso.coef_)
    elastic_coefs.append(elastic.coef_)

lasso_coefs = np.array(lasso_coefs)
elastic_coefs = np.array(elastic_coefs)

print("STABILITY ANALYSIS: ELASTIC NET vs LASSO")
print("="*60)
print(f"\nWith highly correlated features (r ≈ 0.999):")
print(f"True: Both x1 and x2 have coefficient = 3")

print(f"\nLASSO (20 bootstrap samples):")
print(f"  x1 coefficient: {lasso_coefs[:,0].mean():.2f} ± {lasso_coefs[:,0].std():.2f}")
print(f"  x2 coefficient: {lasso_coefs[:,1].mean():.2f} ± {lasso_coefs[:,1].std():.2f}")
print(f"  Times x1 = 0: {np.sum(np.abs(lasso_coefs[:,0]) < 0.01)}")
print(f"  Times x2 = 0: {np.sum(np.abs(lasso_coefs[:,1]) < 0.01)}")

print(f"\nELASTIC NET (20 bootstrap samples):")
print(f"  x1 coefficient: {elastic_coefs[:,0].mean():.2f} ± {elastic_coefs[:,0].std():.2f}")
print(f"  x2 coefficient: {elastic_coefs[:,1].mean():.2f} ± {elastic_coefs[:,1].std():.2f}")
print(f"  Times x1 = 0: {np.sum(np.abs(elastic_coefs[:,0]) < 0.01)}")
print(f"  Times x2 = 0: {np.sum(np.abs(elastic_coefs[:,1]) < 0.01)}")

print(f"\n💡 INSIGHT:")
print(f"   Lasso: Unstable! Sometimes picks x1, sometimes x2")
print(f"   Elastic: Stable! Consistently keeps both with similar values")
Enter fullscreen mode Exit fullscreen mode

Output:

STABILITY ANALYSIS: ELASTIC NET vs LASSO
============================================================

With highly correlated features (r ≈ 0.999):
True: Both x1 and x2 have coefficient = 3

LASSO (20 bootstrap samples):
  x1 coefficient: 3.21 ± 2.89
  x2 coefficient: 2.87 ± 2.76
  Times x1 = 0: 8
  Times x2 = 0: 7

ELASTIC NET (20 bootstrap samples):
  x1 coefficient: 2.78 ± 0.34
  x2 coefficient: 2.71 ± 0.31
  Times x1 = 0: 0
  Times x2 = 0: 0

💡 INSIGHT:
   Lasso: Unstable! Sometimes picks x1, sometimes x2
   Elastic: Stable! Consistently keeps both with similar values
Enter fullscreen mode Exit fullscreen mode

Complete Elastic Net Workflow

import numpy as np
from sklearn.linear_model import ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

def elastic_net_workflow(X, y, feature_names=None):
    """
    Complete Elastic Net workflow with cross-validation.
    """

    print("="*70)
    print("ELASTIC NET WORKFLOW")
    print("="*70)

    # 1. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print(f"\n1. Data Split: {len(X_train)} train, {len(X_test)} test")

    # 2. Standardize
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    print("2. Features standardized")

    # 3. Cross-validation for both alpha and l1_ratio
    elastic_cv = ElasticNetCV(
        l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99],
        alphas=np.logspace(-4, 1, 50),
        cv=5,
        random_state=42,
        max_iter=10000
    )
    elastic_cv.fit(X_train_scaled, y_train)

    print(f"\n3. Cross-Validation Results:")
    print(f"   Best alpha:    {elastic_cv.alpha_:.6f}")
    print(f"   Best l1_ratio: {elastic_cv.l1_ratio_:.2f}")

    # Interpret l1_ratio
    if elastic_cv.l1_ratio_ >= 0.9:
        interpretation = "(mostly Lasso-like)"
    elif elastic_cv.l1_ratio_ <= 0.1:
        interpretation = "(mostly Ridge-like)"
    else:
        interpretation = "(balanced mix)"
    print(f"   Interpretation: {interpretation}")

    # 4. Feature selection summary
    n_features = X.shape[1]
    n_selected = np.sum(elastic_cv.coef_ != 0)
    selected_idx = np.where(elastic_cv.coef_ != 0)[0]

    print(f"\n4. Feature Selection:")
    print(f"   Total features: {n_features}")
    print(f"   Selected: {n_selected} ({n_selected/n_features*100:.1f}%)")

    # 5. Top features
    if feature_names is not None and n_selected > 0:
        print(f"\n5. Top Selected Features:")
        sorted_features = sorted(
            [(feature_names[i], elastic_cv.coef_[i]) for i in selected_idx],
            key=lambda x: abs(x[1]), reverse=True
        )
        for name, coef in sorted_features[:10]:
            print(f"   {name:<25} {coef:>10.4f}")

    # 6. Performance
    y_pred = elastic_cv.predict(X_test_scaled)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"\n6. Test Performance:")
    print(f"   RMSE: {rmse:.4f}")
    print(f"   R²:   {r2:.4f}")

    return elastic_cv, scaler, selected_idx

# Example usage
np.random.seed(42)
X = np.random.randn(500, 50)
y = 3*X[:,0] + 2*X[:,1] + X[:,2] + 0.5*X[:,3] + np.random.randn(500)*0.5
feature_names = [f'Feature_{i}' for i in range(50)]

model, scaler, selected = elastic_net_workflow(X, y, feature_names)
Enter fullscreen mode Exit fullscreen mode

Quick Reference: The Complete Comparison

Aspect Ridge Lasso Elastic Net
Penalty λΣβ² λΣ\ β\
Geometry Circle Diamond Rounded diamond
Sparsity None High Moderate
Feature Selection No Yes Yes
Correlated Features Shares weight Picks one (unstable) Groups together (stable)
Max Features (p>n) All At most n More than n
Best For Multicollinearity only Independent features Correlated + selection
Default Choice When you need all When features independent When unsure!

Common Mistakes

Mistake 1: Forgetting to Tune l1_ratio

# ❌ WRONG: Using arbitrary l1_ratio
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)

# ✅ RIGHT: Cross-validate both parameters
elastic_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95],
    cv=5
)
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Not Standardizing

# ❌ WRONG: Features on different scales
elastic = ElasticNet().fit(X, y)

# ✅ RIGHT: Standardize first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
elastic = ElasticNet().fit(X_scaled, y)
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Using Pure Lasso When Features Are Correlated

# ❌ WRONG: Pure Lasso with correlated features
lasso = Lasso(alpha=0.1).fit(X_correlated, y)  # Unstable!

# ✅ RIGHT: Elastic Net for stability
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_correlated, y)
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Elastic Net = Lasso + Ridge — Combines L1 and L2 penalties

  2. l1_ratio controls the mix — 1.0 = Lasso, 0.0 = Ridge, 0.5 = balanced

  3. Grouping effect — Correlated features get similar coefficients

  4. More stable than Lasso — Doesn't arbitrarily pick between twins

  5. Can select > n features — Unlike Lasso when p > n

  6. Safe default choice — When unsure between Ridge and Lasso

  7. Cross-validate BOTH parameters — alpha AND l1_ratio

  8. MUST standardize — Both penalties are scale-sensitive


The One-Sentence Summary

Consultant Ridge kept everyone with pay cuts, Consultant Lasso fired people but arbitrarily split up identical twins, and Consultant Elastic Net combined both approaches — firing non-essential people while keeping correlated important people together with shared responsibilities, getting the best of both worlds through a penalty that's part L1 (for sparsity) and part L2 (for grouping).


What's Next?

Now that you understand Ridge, Lasso, and Elastic Net, you're ready for:

  • Polynomial Regression — When linear isn't enough
  • Regularization Path Analysis — Deep dive into coefficient trajectories
  • Logistic Regression — Linear models for classification
  • Generalized Linear Models — Beyond normal distributions

Follow me for the next article in this series!


Let's Connect!

If "grouping correlated features together" finally clicked, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

When did Elastic Net save your model? I had a genomics dataset where genes came in co-regulated groups — Lasso kept picking random representatives, Elastic Net kept them together. The biologists were happy! 🧬


The difference between "I'll fire one twin randomly" and "I'll keep both twins and share responsibilities"? Elastic Net. When your features might be correlated, it's often the smartest choice.


Share this with someone stuck between Ridge and Lasso. There's a third option, and it might be exactly what they need.

Happy regularizing! 🎯

Top comments (0)