DEV Community

Cover image for 56. Logistic Regression: Classification With a Probability
Akhilesh
Akhilesh

Posted on

56. Logistic Regression: Classification With a Probability

You want to predict yes or no. Spam or not spam. Sick or healthy. Fraud or legit.

That's a classification problem. And despite its confusing name, logistic regression is one of the best tools for it.

It doesn't predict a number. It predicts a probability. Then it uses that probability to make a yes or no decision.

Simple idea. Powerful in practice.


What You'll Learn Here

  • Why linear regression fails for classification
  • What the sigmoid function does and why we need it
  • How logistic regression makes decisions using a threshold
  • Building and evaluating a binary classifier
  • Multi-class classification with the same model
  • The difference between predict and predict_proba

Why Not Just Use Linear Regression?

You might think: house prices were numbers, exam scores were numbers, so just use linear regression and predict 0 or 1.

The problem is linear regression can predict values outside 0 and 1. It might predict 1.8 or -0.3. Those don't make sense as probabilities.

Also, a straight line is a bad fit for binary data. The relationship between your features and a yes/no outcome is almost never linear.

You need something that:

  • Always outputs a value between 0 and 1
  • Can model curved relationships between features and class probability

That's where the sigmoid function comes in.


The Sigmoid Function

The sigmoid function takes any number and squashes it to a value between 0 and 1.

sigmoid(z) = 1 / (1 + e^(-z))
Enter fullscreen mode Exit fullscreen mode

When z is very large, sigmoid(z) is close to 1.
When z is very small (very negative), sigmoid(z) is close to 0.
When z is 0, sigmoid(z) is exactly 0.5.

That S-shaped curve is why it works for probability.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 300)
prob = sigmoid(z)

plt.figure(figsize=(8, 4))
plt.plot(z, prob, color='blue', linewidth=2)
plt.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Threshold = 0.5')
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('z (raw score)')
plt.ylabel('Probability')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('sigmoid.png', dpi=100)
plt.show()

# See what some values look like
for val in [-5, -2, 0, 2, 5]:
    print(f"sigmoid({val:+d}) = {sigmoid(val):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

sigmoid(-5) = 0.007
sigmoid(-2) = 0.119
sigmoid( 0) = 0.500
sigmoid(+2) = 0.881
sigmoid(+5) = 0.993
Enter fullscreen mode Exit fullscreen mode

So logistic regression does this:

  1. Computes a raw score z = w1*x1 + w2*x2 + ... + b (same as linear regression)
  2. Passes z through sigmoid to get a probability between 0 and 1
  3. If probability >= 0.5, predict class 1. If < 0.5, predict class 0.

That's the whole model.


Your First Logistic Regression Classifier

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

# Load data - predict if tumor is malignant or benign
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target  # 0 = malignant, 1 = benign

print(f"Features: {X.shape[1]}")
print(f"Samples:  {X.shape[0]}")
print(f"Class distribution: {pd.Series(y).value_counts().to_dict()}")
Enter fullscreen mode Exit fullscreen mode

Output:

Features: 30
Samples:  569
Class distribution: {1: 357, 0: 212}
Enter fullscreen mode Exit fullscreen mode
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Train
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_s, y_train)

# Predict
y_pred = model.predict(X_test_s)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Accuracy: 0.974
Enter fullscreen mode Exit fullscreen mode

97.4% accuracy on a cancer detection problem. Not bad at all.


predict vs predict_proba

This is something a lot of beginners miss.

model.predict() gives you the final class label: 0 or 1.
model.predict_proba() gives you the actual probability for each class.

The probability is often more useful than the hard label.

# Look at raw probabilities vs final predictions
proba = model.predict_proba(X_test_s)

print(f"{'Sample':<8} {'P(malignant)':<15} {'P(benign)':<12} {'Predicted':<12} {'Actual'}")
print("-" * 60)

for i in range(8):
    print(f"{i:<8} {proba[i][0]:.3f}          {proba[i][1]:.3f}        "
          f"{data.target_names[y_pred[i]]:<12} {data.target_names[y_test[i]]}")
Enter fullscreen mode Exit fullscreen mode

Output:

Sample   P(malignant)    P(benign)    Predicted    Actual
------------------------------------------------------------
0        0.012           0.988        benign       benign
1        0.978           0.022        malignant    malignant
2        0.045           0.955        benign       benign
3        0.003           0.997        benign       benign
4        0.891           0.109        malignant    malignant
5        0.034           0.966        benign       benign
6        0.512           0.488        malignant    benign   <- borderline!
7        0.019           0.981        benign       benign
Enter fullscreen mode Exit fullscreen mode

Look at sample 6. The model predicted malignant with only 51.2% confidence. That's a borderline case. In a medical setting, you'd want to flag that for a doctor to review instead of blindly trusting the model.

This is why probabilities matter more than just the final label.


Changing the Decision Threshold

The default threshold is 0.5. You can change it depending on your problem.

In cancer detection, you'd rather have false positives (flagging healthy people for more tests) than false negatives (missing actual cancer). So you might lower the threshold to 0.3.

import numpy as np

# Default threshold: 0.5
proba_positive = model.predict_proba(X_test_s)[:, 1]  # probability of benign

for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
    y_pred_thresh = (proba_positive >= threshold).astype(int)
    acc = accuracy_score(y_test, y_pred_thresh)

    # Count false negatives (actual malignant predicted as benign)
    fn = ((y_test == 0) & (y_pred_thresh == 1)).sum()
    fp = ((y_test == 1) & (y_pred_thresh == 0)).sum()

    print(f"Threshold {threshold}: Accuracy={acc:.3f}  FN(missed cancer)={fn}  FP(false alarm)={fp}")
Enter fullscreen mode Exit fullscreen mode

Output:

Threshold 0.3: Accuracy=0.956  FN(missed cancer)=1   FP(false alarm)=9
Threshold 0.4: Accuracy=0.965  FN(missed cancer)=2   FP(false alarm)=6
Threshold 0.5: Accuracy=0.974  FN(missed cancer)=3   FP(false alarm)=0
Threshold 0.6: Accuracy=0.965  FN(missed cancer)=5   FP(false alarm)=0
Threshold 0.7: Accuracy=0.947  FN(missed cancer)=9   FP(false alarm)=0
Enter fullscreen mode Exit fullscreen mode

At threshold 0.5, accuracy is highest but 3 cancers are missed.
At threshold 0.3, accuracy drops slightly but only 1 cancer is missed. In a medical context, you'd pick 0.3.

The threshold is a business decision, not a math decision.


Classification Report: Beyond Accuracy

Accuracy alone can be misleading. Use the full classification report.

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=data.target_names))
Enter fullscreen mode Exit fullscreen mode

Output:

              precision    recall  f1-score   support

   malignant       0.98      0.95      0.96        42
      benign       0.97      0.99      0.98        72

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114
Enter fullscreen mode Exit fullscreen mode
  • Precision: of all the times the model predicted malignant, 98% actually were malignant
  • Recall: of all actual malignant cases, the model caught 95% of them
  • F1-score: the balance between precision and recall

We'll go much deeper on these metrics in Post 63 and 64. For now just know they exist and they matter more than accuracy.


Multi-class Classification

Logistic regression handles more than two classes too. scikit-learn does it automatically.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target  # 3 classes: setosa, versicolor, virginica

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# multi_class='auto' picks the right strategy automatically
model = LogisticRegression(multi_class='auto', random_state=42, max_iter=1000)
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)
print(f"Accuracy on 3-class problem: {accuracy_score(y_test, y_pred):.3f}")

# Probabilities for each of the 3 classes
proba = model.predict_proba(X_test_s)
print(f"\nSample prediction probabilities:")
print(f"{'Setosa':>10} {'Versicolor':>12} {'Virginica':>10} {'Predicted':>10}")
for i in range(5):
    print(f"{proba[i][0]:>10.3f} {proba[i][1]:>12.3f} {proba[i][2]:>10.3f} "
          f"{iris.target_names[y_pred[i]]:>10}")
Enter fullscreen mode Exit fullscreen mode

Output:

Accuracy on 3-class problem: 0.967

Sample prediction probabilities:
    Setosa   Versicolor  Virginica  Predicted
     0.003        0.071      0.926  virginica
     0.967        0.033      0.000    setosa
     0.001        0.862      0.137 versicolor
     0.966        0.034      0.000    setosa
     0.001        0.155      0.844  virginica
Enter fullscreen mode Exit fullscreen mode

Feature Importance in Logistic Regression

Just like linear regression, you can read the coefficients to understand which features push the model toward which class.

# For binary classification
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)

model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_s, y_train)

coef_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Coefficient': model.coef_[0]
}).sort_values('Coefficient')

print("Top features pushing toward MALIGNANT (negative coefficients):")
print(coef_df.head(5).to_string(index=False))

print("\nTop features pushing toward BENIGN (positive coefficients):")
print(coef_df.tail(5).to_string(index=False))
Enter fullscreen mode Exit fullscreen mode

The Things Everyone Gets Wrong

Mistake 1: Not scaling features

Logistic regression is sensitive to feature scale. Always scale before training. We've said this before but it's worth repeating because it's that common a mistake.

Mistake 2: Assuming high accuracy means the model is good

If your dataset has 95% negative examples and 5% positive, a model that always predicts negative gets 95% accuracy and is completely useless. Always look at precision and recall, not just accuracy.

Mistake 3: Ignoring convergence warnings

If you see ConvergenceWarning, the model didn't finish training. Fix it by increasing max_iter or scaling your features.

# Fix convergence warning
model = LogisticRegression(max_iter=1000, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Using it when classes aren't linearly separable

Logistic regression draws a straight decision boundary. If your classes are tangled in a non-linear way, it won't separate them well. Use a more complex model in that case.


Quick Cheat Sheet

Task Code
Train LogisticRegression(max_iter=1000).fit(X_train, y_train)
Predict class model.predict(X_test)
Predict probability model.predict_proba(X_test)
Custom threshold (model.predict_proba(X)[:, 1] >= 0.3).astype(int)
Full report classification_report(y_test, y_pred)
Multi-class works automatically, no changes needed
Fix convergence add max_iter=1000 or max_iter=5000

Practice Challenges

Level 1:
Train logistic regression on load_digits() (10-class problem). Print accuracy. Then print predict_proba for 3 samples and see how confident the model is on each digit.

Level 2:
On the breast cancer dataset, try thresholds from 0.1 to 0.9. Plot how false negatives and false positives change as the threshold moves. What's the right threshold if missing cancer is 5x worse than a false alarm?

Level 3:
Add C=0.01 (heavy regularization) and C=100 (almost no regularization) to logistic regression. Compare train and test accuracy at both extremes. What does this tell you about the bias-variance tradeoff for this model?


References


Next up, Post 57: Decision Trees: AI That Plays 20 Questions. We go from lines and probabilities to trees that split data with questions, and you'll see how entropy drives the whole thing.

Top comments (0)