Akhilesh

Posted on May 7

56. Logistic Regression: Classification With a Probability

#ai #programming #productivity #python

You want to predict yes or no. Spam or not spam. Sick or healthy. Fraud or legit.

That's a classification problem. And despite its confusing name, logistic regression is one of the best tools for it.

It doesn't predict a number. It predicts a probability. Then it uses that probability to make a yes or no decision.

Simple idea. Powerful in practice.

What You'll Learn Here

Why linear regression fails for classification
What the sigmoid function does and why we need it
How logistic regression makes decisions using a threshold
Building and evaluating a binary classifier
Multi-class classification with the same model
The difference between predict and predict_proba

Why Not Just Use Linear Regression?

You might think: house prices were numbers, exam scores were numbers, so just use linear regression and predict 0 or 1.

The problem is linear regression can predict values outside 0 and 1. It might predict 1.8 or -0.3. Those don't make sense as probabilities.

Also, a straight line is a bad fit for binary data. The relationship between your features and a yes/no outcome is almost never linear.

You need something that:

Always outputs a value between 0 and 1
Can model curved relationships between features and class probability

That's where the sigmoid function comes in.

The Sigmoid Function

The sigmoid function takes any number and squashes it to a value between 0 and 1.

sigmoid(z) = 1 / (1 + e^(-z))

When z is very large, sigmoid(z) is close to 1.
When z is very small (very negative), sigmoid(z) is close to 0.
When z is 0, sigmoid(z) is exactly 0.5.

That S-shaped curve is why it works for probability.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 300)
prob = sigmoid(z)

plt.figure(figsize=(8, 4))
plt.plot(z, prob, color='blue', linewidth=2)
plt.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Threshold = 0.5')
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('z (raw score)')
plt.ylabel('Probability')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('sigmoid.png', dpi=100)
plt.show()

# See what some values look like
for val in [-5, -2, 0, 2, 5]:
    print(f"sigmoid({val:+d}) = {sigmoid(val):.3f}")

Output:

sigmoid(-5) = 0.007
sigmoid(-2) = 0.119
sigmoid( 0) = 0.500
sigmoid(+2) = 0.881
sigmoid(+5) = 0.993

So logistic regression does this:

Computes a raw score z = w1*x1 + w2*x2 + ... + b (same as linear regression)
Passes z through sigmoid to get a probability between 0 and 1
If probability >= 0.5, predict class 1. If < 0.5, predict class 0.

That's the whole model.

Your First Logistic Regression Classifier

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

# Load data - predict if tumor is malignant or benign
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target  # 0 = malignant, 1 = benign

print(f"Features: {X.shape[1]}")
print(f"Samples:  {X.shape[0]}")
print(f"Class distribution: {pd.Series(y).value_counts().to_dict()}")

Output:

Features: 30
Samples:  569
Class distribution: {1: 357, 0: 212}

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Train
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_s, y_train)

# Predict
y_pred = model.predict(X_test_s)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.3f}")

Output:

Accuracy: 0.974

97.4% accuracy on a cancer detection problem. Not bad at all.

predict vs predict_proba

This is something a lot of beginners miss.

model.predict() gives you the final class label: 0 or 1.
model.predict_proba() gives you the actual probability for each class.

The probability is often more useful than the hard label.

# Look at raw probabilities vs final predictions
proba = model.predict_proba(X_test_s)

print(f"{'Sample':<8} {'P(malignant)':<15} {'P(benign)':<12} {'Predicted':<12} {'Actual'}")
print("-" * 60)

for i in range(8):
    print(f"{i:<8} {proba[i][0]:.3f}          {proba[i][1]:.3f}        "
          f"{data.target_names[y_pred[i]]:<12} {data.target_names[y_test[i]]}")

Output:

Sample   P(malignant)    P(benign)    Predicted    Actual
------------------------------------------------------------
0        0.012           0.988        benign       benign
1        0.978           0.022        malignant    malignant
2        0.045           0.955        benign       benign
3        0.003           0.997        benign       benign
4        0.891           0.109        malignant    malignant
5        0.034           0.966        benign       benign
6        0.512           0.488        malignant    benign   <- borderline!
7        0.019           0.981        benign       benign

Look at sample 6. The model predicted malignant with only 51.2% confidence. That's a borderline case. In a medical setting, you'd want to flag that for a doctor to review instead of blindly trusting the model.

This is why probabilities matter more than just the final label.

Changing the Decision Threshold

The default threshold is 0.5. You can change it depending on your problem.

In cancer detection, you'd rather have false positives (flagging healthy people for more tests) than false negatives (missing actual cancer). So you might lower the threshold to 0.3.

import numpy as np

# Default threshold: 0.5
proba_positive = model.predict_proba(X_test_s)[:, 1]  # probability of benign

for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
    y_pred_thresh = (proba_positive >= threshold).astype(int)
    acc = accuracy_score(y_test, y_pred_thresh)

    # Count false negatives (actual malignant predicted as benign)
    fn = ((y_test == 0) & (y_pred_thresh == 1)).sum()
    fp = ((y_test == 1) & (y_pred_thresh == 0)).sum()

    print(f"Threshold {threshold}: Accuracy={acc:.3f}  FN(missed cancer)={fn}  FP(false alarm)={fp}")

Output:

Threshold 0.3: Accuracy=0.956  FN(missed cancer)=1   FP(false alarm)=9
Threshold 0.4: Accuracy=0.965  FN(missed cancer)=2   FP(false alarm)=6
Threshold 0.5: Accuracy=0.974  FN(missed cancer)=3   FP(false alarm)=0
Threshold 0.6: Accuracy=0.965  FN(missed cancer)=5   FP(false alarm)=0
Threshold 0.7: Accuracy=0.947  FN(missed cancer)=9   FP(false alarm)=0

At threshold 0.5, accuracy is highest but 3 cancers are missed.
At threshold 0.3, accuracy drops slightly but only 1 cancer is missed. In a medical context, you'd pick 0.3.

The threshold is a business decision, not a math decision.

Classification Report: Beyond Accuracy

Accuracy alone can be misleading. Use the full classification report.

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=data.target_names))

Output:

              precision    recall  f1-score   support

   malignant       0.98      0.95      0.96        42
      benign       0.97      0.99      0.98        72

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Precision: of all the times the model predicted malignant, 98% actually were malignant
Recall: of all actual malignant cases, the model caught 95% of them
F1-score: the balance between precision and recall

We'll go much deeper on these metrics in Post 63 and 64. For now just know they exist and they matter more than accuracy.

Multi-class Classification

Logistic regression handles more than two classes too. scikit-learn does it automatically.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target  # 3 classes: setosa, versicolor, virginica

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# multi_class='auto' picks the right strategy automatically
model = LogisticRegression(multi_class='auto', random_state=42, max_iter=1000)
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)
print(f"Accuracy on 3-class problem: {accuracy_score(y_test, y_pred):.3f}")

# Probabilities for each of the 3 classes
proba = model.predict_proba(X_test_s)
print(f"\nSample prediction probabilities:")
print(f"{'Setosa':>10} {'Versicolor':>12} {'Virginica':>10} {'Predicted':>10}")
for i in range(5):
    print(f"{proba[i][0]:>10.3f} {proba[i][1]:>12.3f} {proba[i][2]:>10.3f} "
          f"{iris.target_names[y_pred[i]]:>10}")

Output:

Accuracy on 3-class problem: 0.967

Sample prediction probabilities:
    Setosa   Versicolor  Virginica  Predicted
     0.003        0.071      0.926  virginica
     0.967        0.033      0.000    setosa
     0.001        0.862      0.137 versicolor
     0.966        0.034      0.000    setosa
     0.001        0.155      0.844  virginica

Feature Importance in Logistic Regression

Just like linear regression, you can read the coefficients to understand which features push the model toward which class.

# For binary classification
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)

model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_s, y_train)

coef_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Coefficient': model.coef_[0]
}).sort_values('Coefficient')

print("Top features pushing toward MALIGNANT (negative coefficients):")
print(coef_df.head(5).to_string(index=False))

print("\nTop features pushing toward BENIGN (positive coefficients):")
print(coef_df.tail(5).to_string(index=False))

The Things Everyone Gets Wrong

Mistake 1: Not scaling features

Logistic regression is sensitive to feature scale. Always scale before training. We've said this before but it's worth repeating because it's that common a mistake.

Mistake 2: Assuming high accuracy means the model is good

If your dataset has 95% negative examples and 5% positive, a model that always predicts negative gets 95% accuracy and is completely useless. Always look at precision and recall, not just accuracy.

Mistake 3: Ignoring convergence warnings

If you see ConvergenceWarning, the model didn't finish training. Fix it by increasing max_iter or scaling your features.

# Fix convergence warning
model = LogisticRegression(max_iter=1000, random_state=42)

Mistake 4: Using it when classes aren't linearly separable

Logistic regression draws a straight decision boundary. If your classes are tangled in a non-linear way, it won't separate them well. Use a more complex model in that case.

Quick Cheat Sheet

Task	Code
Train	`LogisticRegression(max_iter=1000).fit(X_train, y_train)`
Predict class	`model.predict(X_test)`
Predict probability	`model.predict_proba(X_test)`
Custom threshold	`(model.predict_proba(X)[:, 1] >= 0.3).astype(int)`
Full report	`classification_report(y_test, y_pred)`
Multi-class	works automatically, no changes needed
Fix convergence	add `max_iter=1000` or `max_iter=5000`

Practice Challenges

Level 1:
Train logistic regression on load_digits() (10-class problem). Print accuracy. Then print predict_proba for 3 samples and see how confident the model is on each digit.

Level 2:
On the breast cancer dataset, try thresholds from 0.1 to 0.9. Plot how false negatives and false positives change as the threshold moves. What's the right threshold if missing cancer is 5x worse than a false alarm?

Level 3:
Add C=0.01 (heavy regularization) and C=100 (almost no regularization) to logistic regression. Compare train and test accuracy at both extremes. What does this tell you about the bias-variance tradeoff for this model?

References

Next up, Post 57: Decision Trees: AI That Plays 20 Questions. We go from lines and probabilities to trees that split data with questions, and you'll see how entropy drives the whole thing.

DEV Community

56. Logistic Regression: Classification With a Probability

What You'll Learn Here

Why Not Just Use Linear Regression?

The Sigmoid Function

Your First Logistic Regression Classifier

predict vs predict_proba

Changing the Decision Threshold

Classification Report: Beyond Accuracy

Multi-class Classification

Feature Importance in Logistic Regression

The Things Everyone Gets Wrong

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)