hqqqqy

Posted on Apr 10

5 Naive Bayes Mistakes That Break Small Medical Datasets

#datascience #machinelearning #python #beginners

5 Naive Bayes Mistakes That Break Small Medical Datasets

My Naive Bayes classifier predicted "no flu" for every single patient, even those with textbook symptoms. The dataset had only 200 records, and I made five mistakes that are invisible on large datasets but catastrophic on small ones.

Mistake #1: Forgetting Laplace Smoothing

The killer bug was a probability of exactly zero. One symptom combination never appeared in the training data, so P(symptoms|flu) = 0. In Naive Bayes, when any probability is zero, the entire prediction becomes zero — no matter how strong the other evidence is.

from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Tiny medical dataset: [fever, cough, fatigue]
X_train = np.array([
    [1, 1, 0],  # Patient 1: flu
    [1, 0, 1],  # Patient 2: flu
    [0, 1, 1],  # Patient 3: no flu
    [0, 0, 1],  # Patient 4: no flu
])
y_train = np.array([1, 1, 0, 0])  # 1 = flu, 0 = no flu

# Test case: fever + cough + fatigue (never seen in training!)
X_test = np.array([[1, 1, 1]])

# WITHOUT Laplace smoothing (alpha=0) - BREAKS
model_broken = MultinomialNB(alpha=0)
model_broken.fit(X_train, y_train)
print(f"Broken prediction: {model_broken.predict(X_test)}")  # Likely wrong

# WITH Laplace smoothing (alpha=1) - WORKS
model_fixed = MultinomialNB(alpha=1.0)
model_fixed.fit(X_train, y_train)
print(f"Fixed prediction: {model_fixed.predict(X_test)}")

Laplace smoothing adds a small count (usually 1) to every feature-class combination, preventing zero probabilities. On large datasets, this barely changes the numbers. On small datasets, it's the difference between a working model and garbage.

In my exploration of Naive Bayes fundamentals and Laplace smoothing, I found that alpha=1.0 is the standard default, but for medical datasets with extreme class imbalance, alpha=0.5 or even alpha=0.1 can work better.

Mistake #2: Ignoring Class Imbalance

My dataset had 180 "no flu" cases and only 20 "flu" cases. Naive Bayes uses prior probabilities: P(flu) = 20/200 = 0.1. Even with strong symptoms, the model defaults to "no flu" because it's 9 times more common.

from sklearn.naive_bayes import GaussianNB

# Imbalanced dataset
X = np.random.randn(200, 3)
y = np.array([1]*20 + [0]*180)  # 10% flu, 90% no flu

model = GaussianNB()
model.fit(X, y)

# Check learned priors
print(f"P(flu) = {model.class_prior_[1]:.3f}")      # ~0.1
print(f"P(no flu) = {model.class_prior_[0]:.3f}")  # ~0.9

The fix: Use class_weight='balanced' in your evaluation metric, or manually adjust priors if you know the real-world distribution differs from your training data:

# If you know real-world flu rate is 30%, not 10%
model.class_prior_ = np.array([0.7, 0.3])  # [no flu, flu]

Mistake #3: Using Gaussian Naive Bayes on Categorical Data

I had binary features (yes/no symptoms) but used GaussianNB, which assumes continuous, normally distributed data. This is like using a ruler to measure temperature — technically possible, but wrong.

Feature Type	Correct Naive Bayes Variant
Binary (yes/no)	`BernoulliNB`
Count data (word frequencies)	`MultinomialNB`
Continuous (temperature, age)	`GaussianNB`
Mixed types	Preprocess or use a different algorithm

from sklearn.naive_bayes import BernoulliNB, GaussianNB

# Binary symptom data
X = np.array([[1, 0, 1], [0, 1, 1], [1, 1, 0]])
y = np.array([1, 0, 1])

# WRONG: GaussianNB on binary data
model_wrong = GaussianNB()
model_wrong.fit(X, y)

# RIGHT: BernoulliNB for binary features
model_right = BernoulliNB()
model_right.fit(X, y)

Mistake #4: Not Checking Feature Independence

Naive Bayes assumes features are independent given the class. In medical data, this is often violated — fever and fatigue are correlated. On large datasets, the model is robust to moderate violations. On small datasets, correlated features get double-counted.

I had "fever" and "high temperature" as separate features. They're the same thing! The model treated them as independent evidence, artificially inflating the probability.

Quick independence check:

import pandas as pd

df = pd.DataFrame(X_train, columns=['fever', 'cough', 'fatigue'])
correlation_matrix = df.corr()
print(correlation_matrix)

# If any correlation > 0.7, consider removing one feature

Mistake #5: Splitting Tiny Datasets Randomly

With only 200 samples, a random 80/20 split might put all 20 flu cases in the training set, leaving zero in the test set. Or worse, put 18 in training and 2 in test — not enough to measure performance.

from sklearn.model_selection import StratifiedKFold

# WRONG: Random split on tiny dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# RIGHT: Stratified K-Fold ensures each fold has both classes
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model = BernoulliNB(alpha=1.0)
    model.fit(X_train, y_train)
    print(f"Fold accuracy: {model.score(X_test, y_test):.3f}")

Stratified K-Fold guarantees that each fold maintains the class distribution. With 5 folds, each test set gets exactly 4 flu cases and 36 no-flu cases.

Key Takeaways for Developers

Always use Laplace smoothing (alpha=1.0 or higher) on small datasets to prevent zero probabilities
Match the Naive Bayes variant to your data type: BernoulliNB for binary, MultinomialNB for counts, GaussianNB for continuous
Check class balance and adjust priors if your training distribution doesn't match production
Remove highly correlated features (correlation > 0.7) to avoid double-counting evidence
Use StratifiedKFold instead of random splits when you have fewer than 1,000 samples

These five mistakes are silent on large datasets but deadly on small ones. If you want to see how Laplace smoothing and prior probabilities interact with real medical data, check out the interactive flu diagnosis simulator — it shows exactly how each parameter affects predictions.

For more on Naive Bayes assumptions and when they break, see the scikit-learn Naive Bayes guide and this classic paper on feature independence.

DEV Community

5 Naive Bayes Mistakes That Break Small Medical Datasets

5 Naive Bayes Mistakes That Break Small Medical Datasets

Mistake #1: Forgetting Laplace Smoothing

Mistake #2: Ignoring Class Imbalance

Mistake #3: Using Gaussian Naive Bayes on Categorical Data

Mistake #4: Not Checking Feature Independence

Mistake #5: Splitting Tiny Datasets Randomly

Key Takeaways for Developers

Top comments (0)