DEV Community

Randhir Kumar
Randhir Kumar

Posted on

Logistic Regression: Beyond the Line - Classifying the World 0️⃣/1️⃣ ✨

Hey there! 👋 Randhir here, the guy behind*TailorMails.dev* (my personalized cold email tool built with AI!). As I dive deeper into ethical hacking, machine learning, and web development, understanding core algorithms like Logistic Regression is essential. It's how we teach machines to make decisions!

In Supervised Learning, we train models to predict an "output" (or "target") variable yy based on "input" features ( xx ). When yy can only be a small number of discrete values (like 'house' or 'apartment', or simply '0' or '1'), we're talking Classification problems.

Today, let's explore Logistic Regression, a fundamental algorithm specifically for binary classification (where yy is typically 00 or 11 ).


Why Linear Regression Fails Here 🚫

You might think, "Why not just use Linear Regression?" Good question! Standard Linear Regression approximates yy with:
hθ(x)=θTx h_\theta(x) = \theta^Tx

  • The Problem: If yy must be 00 or 11 , it makes no sense for our model to output values like 55 or 2-2 . Linear Regression can easily do that! We need an output bounded between 00 and 11 to represent probabilities.

The Logistic Regression Hypothesis: The Sigmoid Solution ✅

To fix this, Logistic Regression introduces a special function to its hypothesis:

  • It uses the logistic function (also known as the sigmoid function):
    g(z)=11+ez g(z) = \frac{1}{1 + e^{-z}}

  • This transforms the linear combination of inputs:
    hθ(x)=g(θTx)=11+eθTx h_\theta(x) = g(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}}

  • Key Benefit: The sigmoid function ensures that hθ(x)h_\theta(x) is always between 00 and 11 , making it perfect for interpreting as a probability (e.g., P(y=1x;θ)P(y=1 | x; \theta) ). This choice isn't arbitrary; it's "fairly natural" due to its ties with Generalized Linear Models (GLMs).


Probabilistic Interpretation & MLE: The "Why" Behind the Model 🧠📊

Just like least-squares regression, Logistic Regression has a strong probabilistic foundation. It's derived as a maximum likelihood estimator under specific assumptions:

  • Core Assumptions:

    • Probability of y=1y=1 : P(y=1x;θ)=hθ(x)P(y = 1 | x; \theta) = h_\theta(x)
    • Probability of y=0y=0 : P(y=0x;θ)=1hθ(x)P(y = 0 | x; \theta) = 1 - h_\theta(x)
    • Compact Form: These can be written beautifully as: p(yx;θ)=(hθ(x))y(1hθ(x))1yp(y | x; \theta) = (h_\theta(x))^y (1 - h_\theta(x))^{1-y}
  • Likelihood Function

    L(θ)L(\theta)

    : Assuming training examples are independent, the likelihood for the whole dataset is:


    L(θ)=i=1np(y(i)x(i);θ)=i=1n(hθ(x(i)))y(i)(1hθ(x(i)))1y(i)
    L(\theta) = \prod_{i=1}^n p(y^{(i)} | x^{(i)}; \theta) = \prod_{i=1}^n (h_\theta(x^{(i)}))^{y^{(i)}} (1 - h_\theta(x^{(i)}))^{1-y^{(i)}}


  • Log-Likelihood (θ)\ell(\theta) : For easier computation, we maximize the log-likelihood:

    (θ)=logL(θ)=i=1ny(i)logh(x(i))+(1y(i))log(1h(x(i))) \ell(\theta) = \log L(\theta) = \sum_{i=1}^n y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log(1 - h(x^{(i)}))
    • Goal: We choose θ\theta to maximize (θ)\ell(\theta) .

Parameter Learning: Gradient Ascent (and Its Cousin!) 🚀

To maximize (θ)\ell(\theta) , we use gradient ascent (it's "ascent" because we're maximizing, not minimizing).

  • Stochastic Gradient Ascent Update Rule: For a single example (x,y)(x, y) :
    θj:=θj+α(yhθ(x))xj \theta_j := \theta_j + \alpha (y - h_\theta(x))x_j

    • Surprise! This looks identical to the LMS (Least Mean Squares) update rule for Linear Regression!
    • Key Difference: In Logistic Regression, hθ(x)h_\theta(x) is a non-linear function of θTx\theta^Tx , making the algorithms distinct despite the similar update form. This similarity hints at a "deeper reason" (GLMs!).
  • Faster Option: For maximizing

    (θ)\ell(\theta)

    , Newton's method (or Newton-Raphson) often converges faster. When applied here, it's also known as Fisher scoring.


Logistic Regression as a GLM: The Grand Unified Theory 🌐

The "naturalness" of the sigmoid function and the connection between Linear and Logistic Regression become crystal clear within the framework of Generalised Linear Models (GLMs). Both are simply special cases!

GLMs are built on three elegant assumptions:

  1. Exponential Family Distribution: The distribution of yx;θy|x; \theta belongs to the Exponential Family. For binary classification, the Bernoulli distribution is the perfect fit.

    • When written in exponential family form, its natural parameter η\eta is related to its mean ϕ\phi (which is P(y=1)P(y=1) ) by: η=log(ϕ/(1ϕ))\eta = \log(\phi/(1-\phi))
    • Inverting this gives us: ϕ=1/(1+eη)\phi = 1/(1+e^{-\eta}) ...precisely the sigmoid function!
  2. Expected Value Prediction: The goal is to predict the expected value of yy given xx , i.e., h(x)=E[yx]h(x) = E[y|x] . For a Bernoulli distribution, E[yx;θ]=ϕE[y|x; \theta] = \phi .

  3. Linear Natural Parameter: The natural parameter η\eta is linearly related to inputs xx :
    η=θTx \eta = \theta^Tx

  • The Result: Combining these assumptions, the Logistic Regression hypothesis naturally emerges:
    hθ(x)=E[yx;θ]=ϕ=1/(1+eη)=1/(1+eθTx)h_\theta(x) = E[y|x; \theta] = \phi = 1/(1+e^{-\eta}) = 1/(1+e^{-\theta^Tx})
    • This shows why the logistic function is a direct "consequence of the definition of GLMs and exponential family distributions" when yy is assumed to be Bernoulli.

A Note on Perceptron Algorithm (Historical Context) 🕰️

Briefly, the Perceptron algorithm is a historical precursor to Logistic Regression.

  • It uses a "threshold function" (outputting exactly 00 or 11 ) instead of the smooth sigmoid.
  • Its update rule also looks identical: θj:=θj+α(y(i)hθ(x(i)))xj(i)\theta_j := \theta_j + \alpha (y^{(i)} - h_\theta(x^{(i)}))x^{(i)}_j .
  • Key Difference: Unlike Logistic Regression, Perceptron's predictions are hard to interpret probabilistically, and it cannot be derived as a maximum likelihood estimation algorithm.

Wrapping Up 🚀

Logistic Regression is a cornerstone for classification problems in supervised learning. It gracefully handles binary outputs by modeling probabilities with the sigmoid function. Its solid foundation comes from probabilistic assumptions (specifically, the Bernoulli distribution) and its derivation as a maximum likelihood estimator within the elegant framework of Generalised Linear Models.

Understanding these underlying principles is invaluable, whether you're building cold email tools like TailorMails.dev or any other AI-powered application. It empowers you to choose the right models and truly understand their behavior.

If this deep dive was helpful or sparked some new ideas, consider supporting my work! You can grab me a virtual coffee here: https://buymeacoffee.com/randhirbuilds. Your support helps me keep learning, building, and sharing! 💪


Top comments (0)