5 Naive Bayes Mistakes That Break Small Medical Datasets
My Naive Bayes classifier predicted "no flu" for every single patient, even those with textbook symptoms. The dataset had only 200 records, and I made five mistakes that are invisible on large datasets but catastrophic on small ones.
Mistake #1: Forgetting Laplace Smoothing
The killer bug was a probability of exactly zero. One symptom combination never appeared in the training data, so P(symptoms|flu) = 0. In Naive Bayes, when any probability is zero, the entire prediction becomes zero — no matter how strong the other evidence is.
from sklearn.naive_bayes import MultinomialNB
import numpy as np
# Tiny medical dataset: [fever, cough, fatigue]
X_train = np.array([
[1, 1, 0], # Patient 1: flu
[1, 0, 1], # Patient 2: flu
[0, 1, 1], # Patient 3: no flu
[0, 0, 1], # Patient 4: no flu
])
y_train = np.array([1, 1, 0, 0]) # 1 = flu, 0 = no flu
# Test case: fever + cough + fatigue (never seen in training!)
X_test = np.array([[1, 1, 1]])
# WITHOUT Laplace smoothing (alpha=0) - BREAKS
model_broken = MultinomialNB(alpha=0)
model_broken.fit(X_train, y_train)
print(f"Broken prediction: {model_broken.predict(X_test)}") # Likely wrong
# WITH Laplace smoothing (alpha=1) - WORKS
model_fixed = MultinomialNB(alpha=1.0)
model_fixed.fit(X_train, y_train)
print(f"Fixed prediction: {model_fixed.predict(X_test)}")
Laplace smoothing adds a small count (usually 1) to every feature-class combination, preventing zero probabilities. On large datasets, this barely changes the numbers. On small datasets, it's the difference between a working model and garbage.
In my exploration of Naive Bayes fundamentals and Laplace smoothing, I found that alpha=1.0 is the standard default, but for medical datasets with extreme class imbalance, alpha=0.5 or even alpha=0.1 can work better.
Mistake #2: Ignoring Class Imbalance
My dataset had 180 "no flu" cases and only 20 "flu" cases. Naive Bayes uses prior probabilities: P(flu) = 20/200 = 0.1. Even with strong symptoms, the model defaults to "no flu" because it's 9 times more common.
from sklearn.naive_bayes import GaussianNB
# Imbalanced dataset
X = np.random.randn(200, 3)
y = np.array([1]*20 + [0]*180) # 10% flu, 90% no flu
model = GaussianNB()
model.fit(X, y)
# Check learned priors
print(f"P(flu) = {model.class_prior_[1]:.3f}") # ~0.1
print(f"P(no flu) = {model.class_prior_[0]:.3f}") # ~0.9
The fix: Use class_weight='balanced' in your evaluation metric, or manually adjust priors if you know the real-world distribution differs from your training data:
# If you know real-world flu rate is 30%, not 10%
model.class_prior_ = np.array([0.7, 0.3]) # [no flu, flu]
Mistake #3: Using Gaussian Naive Bayes on Categorical Data
I had binary features (yes/no symptoms) but used GaussianNB, which assumes continuous, normally distributed data. This is like using a ruler to measure temperature — technically possible, but wrong.
| Feature Type | Correct Naive Bayes Variant |
|---|---|
| Binary (yes/no) | BernoulliNB |
| Count data (word frequencies) | MultinomialNB |
| Continuous (temperature, age) | GaussianNB |
| Mixed types | Preprocess or use a different algorithm |
from sklearn.naive_bayes import BernoulliNB, GaussianNB
# Binary symptom data
X = np.array([[1, 0, 1], [0, 1, 1], [1, 1, 0]])
y = np.array([1, 0, 1])
# WRONG: GaussianNB on binary data
model_wrong = GaussianNB()
model_wrong.fit(X, y)
# RIGHT: BernoulliNB for binary features
model_right = BernoulliNB()
model_right.fit(X, y)
Mistake #4: Not Checking Feature Independence
Naive Bayes assumes features are independent given the class. In medical data, this is often violated — fever and fatigue are correlated. On large datasets, the model is robust to moderate violations. On small datasets, correlated features get double-counted.
I had "fever" and "high temperature" as separate features. They're the same thing! The model treated them as independent evidence, artificially inflating the probability.
Quick independence check:
import pandas as pd
df = pd.DataFrame(X_train, columns=['fever', 'cough', 'fatigue'])
correlation_matrix = df.corr()
print(correlation_matrix)
# If any correlation > 0.7, consider removing one feature
Mistake #5: Splitting Tiny Datasets Randomly
With only 200 samples, a random 80/20 split might put all 20 flu cases in the training set, leaving zero in the test set. Or worse, put 18 in training and 2 in test — not enough to measure performance.
from sklearn.model_selection import StratifiedKFold
# WRONG: Random split on tiny dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# RIGHT: Stratified K-Fold ensures each fold has both classes
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model = BernoulliNB(alpha=1.0)
model.fit(X_train, y_train)
print(f"Fold accuracy: {model.score(X_test, y_test):.3f}")
Stratified K-Fold guarantees that each fold maintains the class distribution. With 5 folds, each test set gets exactly 4 flu cases and 36 no-flu cases.
Key Takeaways for Developers
-
Always use Laplace smoothing (
alpha=1.0or higher) on small datasets to prevent zero probabilities -
Match the Naive Bayes variant to your data type:
BernoulliNBfor binary,MultinomialNBfor counts,GaussianNBfor continuous - Check class balance and adjust priors if your training distribution doesn't match production
- Remove highly correlated features (correlation > 0.7) to avoid double-counting evidence
- Use StratifiedKFold instead of random splits when you have fewer than 1,000 samples
These five mistakes are silent on large datasets but deadly on small ones. If you want to see how Laplace smoothing and prior probabilities interact with real medical data, check out the interactive flu diagnosis simulator — it shows exactly how each parameter affects predictions.
For more on Naive Bayes assumptions and when they break, see the scikit-learn Naive Bayes guide and this classic paper on feature independence.
Top comments (0)