DEV Community

Jenil Sheth
Jenil Sheth

Posted on

L1 L2 Regularization

Prompt:

L1 and L2 regularization ke baare me samjaao aur uska derivatives bhi nikaalo

Response:

ठीक है 🙂

मैं L1 और L2 regularization को intuition + maths दोनों के साथ समझाता हूँ, और फिर derivatives भी निकालते हैं।


1️⃣ Regularization क्यों ज़रूरी है?

Machine Learning models (especially linear / deep models) में overfitting की problem आती है —

model training data तो बहुत अच्छा सीख लेता है, लेकिन new data पर fail कर देता है।

👉 Regularization का idea है:

“Weights को unnecessarily बड़ा होने से रोको”

इससे model simple + generalizable बनता है।


2️⃣ Base Loss Function (without regularization)

मान लो linear regression है:

y^=wx+b\hat{y} = wx + b

और loss function (MSE):

J(w)=12mi=1m(yiy^i)2J(w) = \frac{1}{2m}\sum_{i=1}^{m}(y_i - \hat{y}_i)^2

अब regularization add करेंगे।


🔹 L2 Regularization (Ridge Regression)

🔸 Definition

L2 regularization में weights के square का penalty add करते हैं:

J(w)=12mi=1m(yiy^i)2+λ2mj=1nwj2J(w) = \frac{1}{2m}\sum_{i=1}^{m}(y_i - \hat{y}i)^2 + \frac{\lambda}{2m}\sum{j=1}^{n} w_j^2

जहाँ:

  • λ\lambda = regularization strength
  • बड़ा λ\lambda ⇒ ज्यादा penalty ⇒ छोटे weights

🔸 Intuition (feel it)

  • Large weights ⇒ heavy penalty
  • Weights zero के पास जाते हैं, लेकिन exact zero नहीं होते
  • Smooth solution देता है

📌 इसलिए L2 = weight decay


🔸 Derivative of L2 Regularization

Penalty term:

λ2mw2\frac{\lambda}{2m} w^2

Derivative w.r.t. ww :

ddw(λ2mw2)=λmw\frac{d}{dw}\left(\frac{\lambda}{2m} w^2\right) = \frac{\lambda}{m} w

🔸 Final Gradient (L2)

Jw=1m(y^y)x+λmw\frac{\partial J}{\partial w} = \frac{1}{m}\sum ( \hat{y} - y )x + \frac{\lambda}{m}w

👉 Gradient descent update:

w:=wα(1m(y^y)x+λmw)w := w - \alpha \left( \frac{1}{m}\sum ( \hat{y} - y )x + \frac{\lambda}{m}w \right)

🔹 L1 Regularization (Lasso Regression)

🔸 Definition

L1 regularization में absolute value of weights का penalty होता है:

J(w)=12m(yy^)2+λmwjJ(w) = \frac{1}{2m}\sum (y - \hat{y})^2 + \frac{\lambda}{m}\sum |w_j|

🔸 Intuition

  • Absolute value penalty
  • Weights को exact zero तक push कर देता है
  • Feature selection करता है automatically ✨

📌 इसलिए L1 = Sparse model


🔸 Derivative of L1 Regularization

Penalty term:

λmw\frac{\lambda}{m}|w|

Derivative:

ddww={+1w>0 1w<0 undefinedw=0\frac{d}{dw}|w| = \begin{cases} +1 & w > 0 \ -1 & w < 0 \ \text{undefined} & w = 0 \end{cases}

So,

ddw(λmw)=λmsign(w)\frac{d}{dw}\left(\frac{\lambda}{m}|w|\right) = \frac{\lambda}{m}\,\text{sign}(w)

🔸 Final Gradient (L1)

Jw=1m(y^y)x+λmsign(w)\frac{\partial J}{\partial w} = \frac{1}{m}\sum (\hat{y} - y)x + \frac{\lambda}{m}\,\text{sign}(w)

3️⃣ L1 vs L2 (Quick Comparison)

Property L1 L2
Penalty ( w
Weights Exact zero Small but non-zero
Feature Selection ✅ Yes ❌ No
Stability Less More
Use when Many useless features All features important

4️⃣ Geometry Intuition (Interview gold ⭐)

  • L1 → Diamond-shaped constraint → hits axes → sparse solution
  • L2 → Circular constraint → smooth shrinkage

5️⃣ Deep Learning Context

  • Neural Networks में:
    • L2 most common (called weight decay)
    • L1 used when sparsity needed

अगर चाहो तो मैं:

  • L1/L2 को vector form में derive करूँ
  • Elastic Net (L1 + L2) समझाऊँ
  • या PyTorch / TensorFlow code के साथ explain करूँ 🔥

बताओ आगे क्या चाहिए 😊


Top comments (0)