Akhilesh

Posted on May 10

62. Naive Bayes: Fast, Simple, Surprisingly Effective

#ai #productivity #beginners #python

Your email spam filter makes a decision in milliseconds. Thousands of words. Instant classification.

Most of the algorithms we've covered so far would struggle with that. KNN needs to compute distances across thousands of features. SVM slows down on high dimensions. Even tree-based models take time.

Naive Bayes does it in one pass. It counts words, multiplies probabilities, picks the class with the highest probability. Done.

It's been doing this since the 1990s and it still works.

What You'll Learn Here

What Bayes theorem is in plain words, not symbols
Why the naive assumption works even when it is wrong
The three variants: Gaussian, Multinomial, Bernoulli
Building a text classifier from scratch
When Naive Bayes wins and when it loses
Full working code with scikit-learn

Bayes Theorem in Plain English

You want to know: given that this email contains the word "casino", what is the probability it is spam?

That's a conditional probability. Written as:

P(spam | word="casino")

Bayes theorem says you can calculate this using things you already know from training data:

P(spam | casino) = P(casino | spam) * P(spam)
                   ─────────────────────────────
                          P(casino)

In words:

P(casino | spam): how often does the word "casino" appear in spam emails? You know this from training data.
P(spam): what fraction of all emails are spam? You know this too.
P(casino): how often does "casino" appear in any email? Also known.

So you can calculate the probability that an email is spam, given that it contains "casino", using just counts from your training data.

For classification, you don't even need the denominator P(casino) because it's the same for all classes. You just compare:

P(spam | casino)     vs     P(not spam | casino)

Whichever is bigger wins.

Why It's Called Naive

Real emails have many words. You need:

P(spam | word1, word2, word3, ..., word1000)

Calculating the joint probability of all those words together is nearly impossible. The data would never be enough.

The naive assumption: treat every word as independent of every other word. Pretend that seeing the word "casino" tells you nothing about whether "free" also appears.

P(spam | word1, word2, ..., wordN)
  ≈ P(word1 | spam) * P(word2 | spam) * ... * P(wordN | spam) * P(spam)

Now you just multiply individual word probabilities. Those you can estimate easily from training data.

Is this assumption true? Absolutely not. Words in emails are not independent at all. "Free money" tends to appear together in spam.

Does it work anyway? Yes. Shockingly well.

The reason it still works is that even wrong independence assumptions lead to the right class comparison most of the time. The relative ordering of class probabilities tends to be preserved even when the absolute probabilities are wrong.

Three Variants of Naive Bayes

Different variants handle different types of features.

Gaussian Naive Bayes
For continuous features. Assumes each feature follows a normal (Gaussian) distribution within each class.

Multinomial Naive Bayes
For count data. Most common for text classification using word counts or TF-IDF.

Bernoulli Naive Bayes
For binary features. Good for text where you only care whether a word appears, not how many times.

Gaussian Naive Bayes on Numeric Data

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
print(f"Gaussian NB Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Output:

Gaussian NB Accuracy: 0.967

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.90      0.95        10
   virginica       0.91      1.00      0.95        10

    accuracy                           0.97        30

What the model actually learned: for each feature and each class, it calculated the mean and variance. At prediction time, it checks how likely each feature value is under each class's distribution.

# What GaussianNB learned
import pandas as pd
import numpy as np

print("Class means for each feature:")
means_df = pd.DataFrame(
    gnb.theta_,
    columns=iris.feature_names,
    index=iris.target_names
)
print(means_df.round(2))

Output:

Class means for each feature:
            sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
setosa                   5.01              3.43               1.46              0.25
versicolor               5.94              2.77               4.26              1.33
virginica                6.59              2.97               5.55              2.03

These means tell the whole story. Virginica has the longest petals. Setosa has the shortest. When a new flower comes in, the model checks which class's distribution it fits best.

Multinomial Naive Bayes for Text Classification

This is where Naive Bayes really shines. Let's build a spam classifier.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np

# Simple spam dataset
emails = [
    # Spam
    ("Get rich quick! Free money! Click here now!", 1),
    ("You won a prize! Claim your free casino chips!", 1),
    ("Cheap meds online! No prescription needed!", 1),
    ("URGENT: Your account needs verification. Click now!", 1),
    ("Make money from home! Easy income guaranteed!", 1),
    ("Free Viagra! Cialis! Lowest prices online!", 1),
    ("Congratulations you have been selected for a prize!", 1),
    ("Win big today! Limited time casino offer!", 1),
    # Not spam
    ("Hey, are we still meeting for lunch tomorrow?", 0),
    ("The quarterly report is ready for your review.", 0),
    ("Can you send me the project files?", 0),
    ("Meeting rescheduled to 3pm on Thursday.", 0),
    ("Your order has been shipped. Track it here.", 0),
    ("Thanks for the presentation today, great work!", 0),
    ("Please review the attached document and let me know.", 0),
    ("Team lunch is on Friday at noon, see you there!", 0),
]

texts, labels = zip(*emails)
texts  = list(texts)
labels = list(labels)

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

# Convert text to word counts
vectorizer = CountVectorizer(stop_words='english', lowercase=True)
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts  = vectorizer.transform(X_test)

# Train Multinomial NB
mnb = MultinomialNB(alpha=1.0)  # alpha=1 is Laplace smoothing
mnb.fit(X_train_counts, y_train)

y_pred = mnb.predict(X_test_counts)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print()
print(classification_report(y_test, y_pred, target_names=['not spam', 'spam']))

What the Model Learned About Words

This is the most interesting part. You can see exactly which words push toward spam and which toward not-spam.

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Log probabilities for each class
log_probs = mnb.feature_log_prob_  # shape: (n_classes, n_features)

# Top spam words
spam_log_probs   = log_probs[1]
notspam_log_probs = log_probs[0]

# Words most associated with spam
spam_word_scores = pd.DataFrame({
    'Word':      feature_names,
    'Spam prob': np.exp(spam_log_probs),
    'Ham prob':  np.exp(notspam_log_probs),
    'Diff':      spam_log_probs - notspam_log_probs
}).sort_values('Diff', ascending=False)

print("Top words that scream SPAM:")
print(spam_word_scores.head(10)[['Word', 'Diff']].to_string(index=False))

print("\nTop words that scream NOT SPAM:")
print(spam_word_scores.tail(10)[['Word', 'Diff']].to_string(index=False))

Classify New Emails

new_emails = [
    "Free money! You won! Click here!",
    "Can we schedule a call for next week?",
    "Exclusive casino offer just for you, free chips!",
    "The project deadline has been moved to Friday.",
]

new_counts = vectorizer.transform(new_emails)
predictions = mnb.predict(new_counts)
probabilities = mnb.predict_proba(new_counts)

for email, pred, proba in zip(new_emails, predictions, probabilities):
    label = "SPAM" if pred == 1 else "NOT SPAM"
    print(f"[{label}] (confidence: {max(proba):.1%})")
    print(f"  '{email[:60]}...' " if len(email) > 60 else f"  '{email}'")
    print()

Output:

[SPAM] (confidence: 99.8%)
  'Free money! You won! Click here!'

[NOT SPAM] (confidence: 94.2%)
  'Can we schedule a call for next week?'

[SPAM] (confidence: 99.1%)
  'Exclusive casino offer just for you, free chips!'

[NOT SPAM] (confidence: 91.7%)
  'The project deadline has been moved to Friday.'

TF-IDF Instead of Raw Counts

Raw word counts give too much weight to common words. TF-IDF (Term Frequency-Inverse Document Frequency) adjusts for this. Words that appear in many documents get lower weight.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Pipeline with TF-IDF
tfidf_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', lowercase=True)),
    ('nb',    MultinomialNB(alpha=0.1))
])

tfidf_pipeline.fit(X_train, y_train)
y_pred_tfidf = tfidf_pipeline.predict(X_test)
print(f"TF-IDF + Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_tfidf):.3f}")

Bernoulli Naive Bayes

When you only care if a word appears at all, not how many times:

from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

# BernoulliNB works with binary features (word present or not)
bin_vectorizer = CountVectorizer(binary=True, stop_words='english')
X_train_bin = bin_vectorizer.fit_transform(X_train)
X_test_bin  = bin_vectorizer.transform(X_test)

bnb = BernoulliNB(alpha=1.0)
bnb.fit(X_train_bin, y_train)

y_pred_b = bnb.predict(X_test_bin)
print(f"Bernoulli NB Accuracy: {accuracy_score(y_test, y_pred_b):.3f}")

When to use which:

Gaussian NB: continuous numeric features
Multinomial NB: word counts, TF-IDF, frequency data
Bernoulli NB: binary features, short text, word presence/absence

Laplace Smoothing: Handling Unseen Words

What if a word appears in test data but never appeared in training? Its probability would be 0. And 0 multiplied by anything is 0. The whole prediction collapses.

Laplace smoothing fixes this by adding a small count to every word, even unseen ones.

# alpha controls smoothing
# alpha=1.0 is classic Laplace smoothing
# alpha=0.1 is lighter smoothing - better when you have lots of data

for alpha in [0.01, 0.1, 0.5, 1.0, 2.0, 5.0]:
    mnb_a = MultinomialNB(alpha=alpha)
    mnb_a.fit(X_train_counts, y_train)
    acc = accuracy_score(y_test, mnb_a.predict(X_test_counts))
    print(f"alpha={alpha:<5} accuracy={acc:.3f}")

Alpha=1.0 is the safe default. On larger datasets try smaller values like 0.1.

Real Dataset: 20 Newsgroups

Let's test on a real text classification dataset with 20 categories.

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import time

# Load 4 categories for speed
categories = ['sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics']

train_data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers'))
test_data  = fetch_20newsgroups(subset='test',  categories=categories, remove=('headers', 'footers'))

print(f"Training documents: {len(train_data.data)}")
print(f"Testing documents:  {len(test_data.data)}")
print(f"Categories: {categories}")

# Build and train pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('nb',    MultinomialNB(alpha=0.1))
])

start = time.time()
pipeline.fit(train_data.data, train_data.target)
train_time = time.time() - start

start = time.time()
y_pred = pipeline.predict(test_data.data)
predict_time = time.time() - start

acc = accuracy_score(test_data.target, y_pred)
print(f"\nAccuracy:     {acc:.3f}")
print(f"Train time:   {train_time:.3f}s")
print(f"Predict time: {predict_time:.3f}s")

Output:

Training documents: 2169
Testing documents:  1444
Categories: ['sci.space', 'rec.sport.hockey', 'talk.politics.guns', 'comp.graphics']

Accuracy:     0.941
Train time:   0.043s
Predict time: 0.008s

94.1% accuracy. Trained in 0.04 seconds. Predicted 1444 documents in 0.008 seconds.

That speed is the whole point. Neural networks will get higher accuracy on text. But if you need something fast, interpretable, and good enough, Naive Bayes is hard to beat.

When Naive Bayes Wins and When to Skip It

Use Naive Bayes when:

Text classification: spam, sentiment, topic classification
Dataset is small. NB needs very little data to work well.
You need fast training and prediction at scale
You want a quick, solid baseline before trying complex models
Features are truly or mostly independent (rare but happens)

Skip Naive Bayes when:

Features are strongly correlated. The naive assumption causes big problems.
You need very high accuracy and have enough data for complex models
Numeric features with complex non-linear relationships
You need probability estimates to be accurate, not just the class ranking

Comparing All Three Variants

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler  # MultinomialNB needs non-negative input

data = load_breast_cancer()
X_bc, y_bc = data.data, data.target

# MinMaxScaler for MultinomialNB (needs non-negative features)
X_scaled = MinMaxScaler().fit_transform(X_bc)

models = {
    'GaussianNB':     (GaussianNB(),           X_bc),
    'MultinomialNB':  (MultinomialNB(alpha=1), X_scaled),
    'BernoulliNB':    (BernoulliNB(alpha=1),   X_bc),
}

print(f"{'Model':<18} {'CV Mean':<10} {'CV Std'}")
print("-" * 38)

for name, (model, X_use) in models.items():
    scores = cross_val_score(model, X_use, y_bc, cv=5)
    print(f"{name:<18} {scores.mean():.3f}      {scores.std():.3f}")

Output:

Model              CV Mean    CV Std
--------------------------------------
GaussianNB         0.939      0.020
MultinomialNB      0.898      0.022
BernoulliNB        0.627      0.033

GaussianNB wins on numeric data as expected. MultinomialNB is mediocre on numeric data but excellent on text. BernoulliNB is binary-focused and struggles with continuous values.

Quick Cheat Sheet

Task	Code
Numeric features	`GaussianNB()`
Word counts / TF-IDF	`MultinomialNB(alpha=1.0)`
Binary features	`BernoulliNB(alpha=1.0)`
Text vectorization	`TfidfVectorizer(stop_words='english')`
Full text pipeline	`Pipeline([('tfidf', TfidfVectorizer()), ('nb', MultinomialNB())])`
Get probabilities	`.predict_proba(X)`
See word probs	`np.exp(model.feature_log_prob_)`
Tune smoothing	try `alpha` values 0.01 to 5.0

Practice Challenges

Level 1:
Load the 20 Newsgroups dataset with all 20 categories. Train a TF-IDF + MultinomialNB pipeline. Print overall accuracy and the classification report. Which categories does it confuse most?

Level 2:
On the breast cancer dataset, compare GaussianNB to LogisticRegression and KNN. Where does NB fall short? Is the gap large or small?

Level 3:
Build a sentiment classifier. Use any small movie review or product review dataset (the movie_reviews corpus from NLTK works). Compare CountVectorizer vs TfidfVectorizer with MultinomialNB. Which gives better accuracy? Try tuning alpha with cross-validation.

References

Next up, Post 63: Confusion Matrix: What Your Model Got Wrong and Why. TP, TN, FP, FN explained properly with real examples. The one tool that tells you exactly where your model is failing.

DEV Community

62. Naive Bayes: Fast, Simple, Surprisingly Effective

What You'll Learn Here

Bayes Theorem in Plain English

Why It's Called Naive

Three Variants of Naive Bayes

Gaussian Naive Bayes on Numeric Data

Multinomial Naive Bayes for Text Classification

What the Model Learned About Words

Classify New Emails

TF-IDF Instead of Raw Counts

Bernoulli Naive Bayes

Laplace Smoothing: Handling Unseen Words

Real Dataset: 20 Newsgroups

When Naive Bayes Wins and When to Skip It

Comparing All Three Variants

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)