Berkan Sesen

Posted on Apr 17 • Originally published at sesen.ai

Text Classification from Scratch: TF-IDF and Naive Bayes

#supervisedlearning #discriminative #probabilistic

Every morning, your inbox separates spam from real email. News apps sort articles into sports, tech, and politics. Customer support systems route tickets to the right team. Behind all of these is text classification: teaching a machine to read a document and assign it a category.

The building blocks are simpler than you might expect. You need a way to convert text into numbers (TF-IDF), a classifier that works well with sparse, high-dimensional data (Naive Bayes), and a few lines of code to tie them together. No deep learning, no GPUs, no embeddings.

By the end of this post, you'll classify news articles into 20 categories with 77% accuracy using just 10 lines of Python, then push that to 84% with hyperparameter tuning. You'll understand exactly how TF-IDF works and why the "naive" independence assumption in Naive Bayes is a feature, not a bug.

Let's Build It

Click the badge to open the interactive notebook:

Here's the complete classifier. We use scikit-learn's 20 Newsgroups dataset, which contains around 18,000 posts across 20 topics, from computer graphics to religion to space exploration:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load training and test data
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

# Build the pipeline: raw text → word counts → TF-IDF → Naive Bayes
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

# Train and evaluate
text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(twenty_test.data)
print(f'Accuracy: {accuracy_score(twenty_test.target, predicted):.1%}')
# Accuracy: 77.4%

With 10 lines of modelling code, we classify documents into one of 20 categories at 77.4% accuracy on unseen data. Random guessing would give 5%.

Let's test it on fresh sentences the model has never seen:

docs_new = [
    'OpenGL shading techniques for real-time rendering',
    'The Detroit Tigers signed a new pitcher today',
    'NASA launched the James Webb telescope last year',
    'Is there evidence for the existence of God?',
]

predicted_new = text_clf.predict(docs_new)
for doc, category in zip(docs_new, predicted_new):
    print(f'{twenty_train.target_names[category]:>28s}  ←  {doc}')

            comp.graphics  ←  OpenGL shading techniques for real-time rendering
        rec.sport.baseball  ←  The Detroit Tigers signed a new pitcher today
                 sci.space  ←  NASA launched the James Webb telescope last year
    soc.religion.christian  ←  Is there evidence for the existence of God?

The model correctly identifies the topic of each sentence. It works by finding which words are most characteristic of each category.

The confusion matrix reveals where the classifier struggles. Related categories like comp.sys.ibm.pc.hardware and comp.sys.mac.hardware (both about computer hardware) are frequently confused, as are talk.religion.misc and soc.religion.christian. These make intuitive sense: documents about Mac hardware and PC hardware use very similar vocabulary.

What Just Happened?

Three components work in sequence: CountVectorizer turns text into word counts, TfidfTransformer re-weights those counts to highlight distinctive words, and MultinomialNB learns which words signal which categories.

Step 1: Turning Text into Numbers

A machine learning model can't read English. It needs numbers. The simplest conversion is the bag of words: count how many times each word appears in a document, ignoring order entirely.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'The cat sat on the mat',
    'The dog sat on the log',
    'The cat chased the dog',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']

print(X.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
#  [0, 0, 1, 1, 0, 1, 1, 2],
#  [1, 1, 1, 0, 0, 0, 0, 2]]

Each row is a document. Each column is a word from the vocabulary. The value is the word count. Notice that "the" always gets a count of 2, regardless of the document. It's everywhere, so it carries no information about which document you're looking at.

On the 20 Newsgroups training set, CountVectorizer discovers around 130,000 unique tokens. Each document becomes a vector of 130,000 dimensions, mostly zeros (since any single post uses only a tiny fraction of the full vocabulary).

Step 2: Weighting Words That Matter

Not all words are equally informative. Words like "the", "is", and "a" appear in every document. What we want are words that are common within a specific category but rare overall. This is exactly what TF-IDF (Term Frequency, Inverse Document Frequency) captures.

The weight for word $t$ in document $d$ is:

Where:

TF (term frequency) = how often the word appears in this document
IDF (inverse document frequency) = $\log\!\frac{1+N}{1+n_t}+1$ , where $N$ is the total number of documents and $n_t$ is the number of documents containing word $t$

A word that appears in every document gets a low IDF, shrinking its weight. A word that appears in only a few documents gets a high IDF, amplifying its signal.

import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X)
print(np.round(X_tfidf.toarray(), 2))

After TF-IDF weighting, the document vectors highlight what's distinctive about each text rather than what's common across all of them.

Step 3: Naive Bayes Classification

Naive Bayes applies Bayes' theorem to classify documents. Given a document with words $w_1, w_2, \ldots, w_n$ , it computes:

The "naive" part is the assumption that words are conditionally independent given the class. This is obviously wrong: the word "neural" is far more likely to appear near "network" than near "baseball". But the simplification works remarkably well in practice because:

We only need the ranking right, not the exact probabilities. If $P(\text{sci.space} \mid \text{doc})$ is the highest, the prediction is correct even if the probability value itself is off.
Independence errors tend to cancel out across thousands of features.
The alternative (modelling all word dependencies) is intractable for vocabularies of 130,000 words.

The MultinomialNB variant uses word counts (or TF-IDF weights) as features and models $P(w_i \mid \text{class})$ as a multinomial distribution. The parameters are estimated via maximum likelihood: the probability of word $w_i$ in class $c$ is simply the fraction of times $w_i$ appears in training documents of class $c$ , with Laplace smoothing to handle words never seen in training.

The Pipeline: Composing the Steps

Scikit-learn's Pipeline chains these three transformations so you can treat the entire workflow as a single estimator:

text_clf = Pipeline([
    ('vect', CountVectorizer()),     # raw text → word counts
    ('tfidf', TfidfTransformer()),   # word counts → TF-IDF weights
    ('clf', MultinomialNB()),        # TF-IDF vectors → class predictions
])

When you call text_clf.fit(X, y), it runs CountVectorizer.fit_transform(), feeds the output to TfidfTransformer.fit_transform(), then passes the result to MultinomialNB.fit(). At prediction time, the same chain runs in sequence. This also means you can do grid search over any parameter in the pipeline using the double-underscore naming convention (vect__ngram_range, clf__alpha).

Going Deeper

Beating the Baseline

Naive Bayes at 77.4% is a strong starting point, but we can improve it in three ways: removing noise (stop words), capturing phrases (bigrams), and tuning the smoothing parameter.

Stop words are common words ("the", "is", "at") that carry little discriminative value. Removing them reduces noise and bumps accuracy from 77.4% to 81.7%:

text_clf_stop = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
text_clf_stop.fit(twenty_train.data, twenty_train.target)
print(f'NB + stop words: {accuracy_score(twenty_test.target, text_clf_stop.predict(twenty_test.data)):.1%}')
# NB + stop words: 81.7%

A 4-point gain for one parameter change.

Grid search systematically explores combinations of pipeline parameters. The naming convention (vect__, tfidf__, clf__) lets you reach into any pipeline step:

from sklearn.model_selection import GridSearchCV

parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],  # unigrams vs unigrams+bigrams
    'tfidf__use_idf': (True, False),         # use IDF weighting or not
    'clf__alpha': (1e-2, 1e-3),              # smoothing strength
}

gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1)
gs_clf.fit(twenty_train.data, twenty_train.target)

print(f'Best CV score: {gs_clf.best_score_:.1%}')
print(f'Best params: {gs_clf.best_params_}')
print(f'Test accuracy: {accuracy_score(twenty_test.target, gs_clf.predict(twenty_test.data)):.1%}')
# Best CV score: 91.6%
# Best params: {'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}
# Test accuracy: 83.6%

The best configuration uses bigrams (ngram_range=(1,2)), IDF weighting, and weak smoothing (alpha=0.001). Bigrams capture phrases like "White House" or "hard drive" that individual words miss. The 5-fold CV score (91.6%) is higher than the test accuracy (83.6%) because cross-validation evaluates on data drawn from the same distribution as training, while the test set may contain authors, topics, or writing styles not seen during training.

If you've read our hyperparameter optimisation post, you'll recognise grid search as the brute-force baseline. With only 8 combinations to evaluate here, it's fast enough.

SVM: A Stronger Classifier

Swapping Naive Bayes for a linear SVM (support vector machine) gives a larger improvement than any amount of NB tuning:

from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                               alpha=1e-3, max_iter=100,
                               random_state=42)),
])
text_clf_svm.fit(twenty_train.data, twenty_train.target)
print(f'SVM accuracy: {accuracy_score(twenty_test.target, text_clf_svm.predict(twenty_test.data)):.1%}')
# SVM accuracy: 82.4%

That's 82.4% out of the box, without any tuning. Grid search for SVM yields 83.5%, virtually identical to the tuned Naive Bayes.

The story is clear: the biggest gains come from better feature representation (bigrams, stop word removal, IDF weighting) rather than the choice of classifier. With good features, even the "naive" model performs competitively.

What the Model Actually Learns

What words does the classifier rely on? Raw class-conditional probabilities are dominated by common words like "the" and "of". To find truly discriminative features, we compare each word's log-probability within a class against its average across all classes:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=5)
X_tfidf = tfidf_vect.fit_transform(twenty_train.data)
clf_disc = MultinomialNB().fit(X_tfidf, twenty_train.target)

feature_names = np.array(tfidf_vect.get_feature_names_out())
log_probs = clf_disc.feature_log_prob_
mean_log_prob = np.mean(log_probs, axis=0)
discriminativeness = log_probs - mean_log_prob

for i, category in enumerate(twenty_train.target_names):
    top_indices = discriminativeness[i].argsort()[-5:][::-1]
    print(f'{category}: {", ".join(feature_names[top_indices])}')

The model learns sensible patterns. sci.space relies on words like "space", "orbit", and "nasa". rec.sport.baseball relies on "baseball", "team", and "pitching". talk.politics.mideast picks up "israel", "armenian", and "turkish". These are the words that carry the strongest evidence for each category, well beyond their background frequency.

Stemming: Reducing Words to Roots

Stemming maps words to their root form ("running" to "run", "computers" to "comput"). This merges related word forms into a single feature, reducing vocabulary size:

import nltk
from nltk.stem.snowball import SnowballStemmer

nltk.download('punkt', quiet=True)
stemmer = SnowballStemmer('english', ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super().build_analyzer()
        return lambda doc: [stemmer.stem(w) for w in analyzer(doc)]

text_clf_stemmed = Pipeline([
    ('vect', StemmedCountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB(fit_prior=False)),
])
text_clf_stemmed.fit(twenty_train.data, twenty_train.target)
print(f'NB + stemming + stop words: '
      f'{accuracy_score(twenty_test.target, text_clf_stemmed.predict(twenty_test.data)):.1%}')

Stemming often gives a small additional boost. The original code uses the Snowball stemmer, a refined version of Porter's classic 1980 algorithm that handles irregular forms more gracefully.

When NOT to Use Bag-of-Words

This approach has clear limitations:

Word order is lost. "Dog bites man" and "man bites dog" produce the same vector. For tasks where order matters (sentiment analysis, textual entailment), you need sequence models or contextual embeddings.
Synonyms are invisible. If test documents use different words for the same concepts, they won't match. Pre-trained embeddings (Word2Vec, BERT) capture semantic similarity.
Short documents suffer. With only a few words, the sparse vector is too noisy for reliable classification. Transformer models handle short texts much better.
Scalability ceiling. As the number of overlapping categories grows, the independence assumption becomes more costly.

For many practical applications, TF-IDF with Naive Bayes remains hard to beat when you factor in the ratio of performance to complexity. It trains in seconds, requires no GPU, and produces interpretable results.

Where This Comes From

McCallum & Nigam (1998)

The foundational paper for Naive Bayes text classification is McCallum, A. & Nigam, K. (1998) "A Comparison of Event Models for Naive Bayes Text Classification", presented at the AAAI Workshop on Learning for Text Categorization.

They compared two Naive Bayes variants for text:

Multi-variate Bernoulli: each word is a binary feature (present or absent). This is BernoulliNB in scikit-learn.
Multinomial: each word is a count feature. This is the MultinomialNB our pipeline uses.

"We find that the multinomial model is almost uniformly superior, especially for large vocabulary sizes."

The multinomial model works better because it uses word frequency information. A document mentioning "baseball" 15 times is stronger evidence for rec.sport.baseball than one mentioning it once. The Bernoulli model discards this frequency signal entirely.

The Multinomial Model

Formally, the predicted class for a document $d$ is:

Where:

$P(c)$ is the class prior (fraction of training documents in class $c$ )
$n_i(d)$ is the count of word $w_i$ in document $d$
$P(w_i \mid c)$ is estimated with Laplace smoothing:

The smoothing parameter $\alpha$ prevents zero probabilities for words that never appeared in a particular class during training. Our grid search found $\alpha = 0.001$ optimal, meaning the model trusts the training data more and smooths less aggressively than the default $\alpha = 1.0$ .

TF-IDF: Salton & Buckley (1988)

TF-IDF was formalised by Salton, G. & Buckley, C. (1988) "Term-weighting approaches in automatic text retrieval", Information Processing & Management. The core idea predates this work: Sparck Jones proposed inverse document frequency in 1972.

Scikit-learn's variant uses:

The "+1" terms prevent division by zero and ensure no word gets zero weight. After computing TF-IDF, each document vector is L2-normalised to unit length.

Historical Context

Text classification has a long lineage:

Maron (1961) — First automatic text classification using probabilistic indexing
Salton (1971) — The SMART retrieval system, introducing many weighting schemes
Sparck Jones (1972) — Inverse document frequency
Lewis (1998) — The Reuters benchmark that standardised evaluation
Joachims (1998) — Showed SVMs outperform NB on text (our results confirm this: 82.4% vs 77.4%)
McCallum & Nigam (1998) — Systematic comparison of NB event models

Today, transformer-based models (BERT, GPT) dominate text classification benchmarks. But TF-IDF with Naive Bayes remains the standard baseline for its speed, interpretability, and surprising competitiveness.

Try It Yourself

The interactive notebook includes exercises:

Subset classification — Use only 4 categories (comp.graphics, rec.sport.baseball, sci.space, talk.politics.mideast). How much does accuracy improve with fewer, more distinct categories?
Feature engineering — Add min_df=5 and max_df=0.5 to CountVectorizer to trim rare and ubiquitous words. How does this affect accuracy and vocabulary size?
Bernoulli vs Multinomial — Replace MultinomialNB with BernoulliNB. Does the McCallum & Nigam finding hold on this dataset?
Beyond bag-of-words — Use TfidfVectorizer with sublinear_tf=True and character n-grams (analyzer='char_wb', ngram_range=(3,5)). Character n-grams capture morphological patterns that word-level features miss.

Interactive Tools

Classification Metrics Calculator — Compute precision, recall, F1, and other metrics from your own confusion matrix
Bayes' Theorem Calculator — Explore the Bayesian reasoning that underpins Naive Bayes classification

Maximum Likelihood Estimation from Scratch — The estimation method behind Naive Bayes parameter learning
From MLE to Bayesian Inference — The Bayes' theorem foundation that powers Naive Bayes classification
Hyperparameter Optimization: Grid vs Random vs Bayesian — A deeper look at grid search and smarter alternatives

Frequently Asked Questions

Why is Naive Bayes called "naive"?

The "naive" refers to the conditional independence assumption: the model assumes that each word in a document is independent of every other word, given the class. This is clearly wrong (e.g. "neural" and "network" tend to co-occur), but it works surprisingly well in practice because classification only requires getting the ranking of class probabilities right, not the exact values. Independence errors tend to cancel out across thousands of features.

What is the difference between TF-IDF and raw word counts?

Raw word counts treat all words equally, so common words like "the" and "is" dominate the representation despite carrying no discriminative information. TF-IDF re-weights each word by how rare it is across the entire corpus. Words that appear in many documents get downweighted, while words distinctive to a few documents get amplified. This makes the representation much more informative for classification.

When should I use Naive Bayes instead of a transformer model like BERT?

Naive Bayes with TF-IDF is an excellent choice when you need fast training (seconds, not hours), interpretability (you can inspect which words drive predictions), or when labelled data is limited. It also requires no GPU. For tasks where word order matters (sentiment analysis, entailment) or where you need state-of-the-art accuracy on competitive benchmarks, transformer models will outperform it significantly.

What does the smoothing parameter alpha do in MultinomialNB?

Alpha controls Laplace smoothing, which prevents zero probabilities for words that never appeared in a particular class during training. With alpha = 1.0 (the default), the model adds a pseudocount of 1 to every word-class combination. Smaller values like 0.001 trust the training data more and smooth less aggressively. The optimal value depends on your dataset and can be found through cross-validation.

Why does the model confuse related categories like PC hardware and Mac hardware?

The bag-of-words representation captures which words appear in a document but not the subtle semantic differences between closely related topics. Categories like PC hardware and Mac hardware share a large portion of their vocabulary (words like "drive", "memory", "board", "system"). The model can only distinguish them by the few words unique to each category, which may not always be present in a given document.

Can TF-IDF handle languages other than English?

Yes. TF-IDF is language-agnostic at its core since it operates on tokens, not linguistic structures. However, you may need to adjust tokenisation for languages without clear word boundaries (e.g. Chinese or Japanese) and consider language-specific stop word lists. Stemming and lemmatisation tools are also language-dependent, so you would need appropriate resources for your target language.

DEV Community