Building a Simple Spam Classifier in Python: A Journey Through NLP Basics and Common Pitfalls Hey everyone!

#python #machinelearning #datascience #nlp

Hey everyone! 👋

Today, I want to share my experience building a simple Spam Email Classifier using Python. This project is a fantastic entry point into Machine Learning and Natural Language Processing (NLP), offering practical skills and insights into how real-world spam filters might work. While seemingly straightforward, I hit a few common roadblocks, especially with setting up the NLP environment, and I'll walk you through how I tackled them.

Why a Spam Classifier?

Beyond being a great learning exercise, spam classification is a perfect example of a binary classification problem (spam vs. ham/not spam) with text data. It involves:

Data Preprocessing: Cleaning messy, unstructured text.
Feature Engineering: Converting text into numerical data that machines can understand.
Model Training: Applying classification algorithms.
Evaluation: Understanding how well our model performs.

Let's dive in!

The Tools We'll Use

Python: The language of choice for ML.
pandas: For data loading and manipulation.
scikit-learn: The go-to library for machine learning algorithms and utilities.
nltk (Natural Language Toolkit): Essential for text preprocessing.
Regular Expressions (re): For advanced text pattern matching.

Step 1: Setting Up Your Environment

This is often where the first hurdles appear! I highly recommend using a virtual environment to keep your project dependencies isolated.

# 1. Create your project folder
mkdir spam_classifier_project
cd spam_classifier_project

# 2. Create a virtual environment
python -m venv venv

# 3. Activate the virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# 4. Create a requirements.txt file
# requirements.txt content:
# pandas>=2.0.0
# scikit-learn>=1.0.0
# nltk>=3.8.0

# 5. Install dependencies
pip install -r requirements.txt

Step 2: Data Acquisition (and the Dataset Link!)

For this project, I used the SMS Spam Collection Dataset from Kaggle. It's perfectly suited because SMS messages are similar to short emails, making the text processing techniques directly applicable.

Dataset Link: SMS Spam Collection Dataset on Kaggle
Download: You'll likely need a free Kaggle account to download.
Placement: Place the downloaded spam.csv file directly into your spam_classifier_project folder.

Step 3: The Code Breakdown

Here's the full script. I'll highlight key sections and the tricky bits I encountered.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
from urllib.error import URLError # Crucial for handling NLTK download errors!

## --- 0. Robust NLTK Data Download ---
##### This was my first major hurdle! NLTK needs specific data files (like stopwords, tokenizers).
##### Initially, I hit an 'AttributeError: module 'nltk.downloader' has no attribute 'DownloadError''
##### The fix? Catching a more general Exception or URLError.

print("--- 0. Checking/Downloading NLTK Data ---")
def safe_nltk_download(package_name):
    try:
        if package_name == 'punkt':
            nltk.data.find(f'tokenizers/{package_name}')
        else:
            nltk.data.find(f'corpora/{package_name}')
        print(f"'{package_name}' already downloaded.")
    except (URLError, Exception) as e:
        print(f"'{package_name}' not found. Attempting to download...")
        try:
            nltk.download(package_name)
            print(f"'{package_name}' downloaded successfully.")
        except Exception as e_download:
            print(f"Failed to download '{package_name}': {e_download}")
            print(f"Please try running 'python -c \"import nltk; nltk.download('{package_name}')\"' manually in your terminal.")

safe_nltk_download('stopwords')
safe_nltk_download('punkt')
safe_nltk_download('wordnet')
safe_nltk_download('omw-1.4')
print("--- NLTK Data Check Complete ---")


# --- 1. Dataset Loading ---
print("\n--- Step 1: Dataset Acquisition and Loading ---")
try:
    df = pd.read_csv('spam.csv', encoding='latin-1')
    df = df[['v1', 'v2']]
    df.columns = ['label', 'text']
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("\nError: 'spam.csv' not found. Ensure it's in the same directory.")
    exit()

# --- 2. Exploratory Data Analysis (EDA) ---
# Simple checks for shape, head, label distribution, and missing values.
print("\n--- Step 2: Exploratory Data Analysis (EDA) ---")
print(f"Dataset shape: {df.shape}")
print(df.head())
print("\nLabel distribution:")
print(df['label'].value_counts())
df['label'] = df['label'].map({'ham': 0, 'spam': 1}) # Convert to numerical
print("\nLabels converted (ham=0, spam=1):")
print(df['label'].value_counts())


# --- 3. Text Pre-processing: The NLP Core ---
# This is where we clean the raw text.
print("\n--- Step 3: Text Pre-processing ---")
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower() # Lowercasing
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
    text = re.sub(r'\d+', '', text) # Remove numbers
    tokens = nltk.word_tokenize(text) # Tokenization
    # Stop word removal & Lemmatization (better than stemming for readability/accuracy)
    cleaned_tokens = [
        lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words
    ]
    return " ".join(cleaned_tokens)

df['processed_text'] = df['text'].apply(preprocess_text)
print("\nOriginal vs Processed Text (first 3 examples):")
for i in range(3):
    print(f"Original: {df['text'].iloc[i]}")
    print(f"Processed: {df['processed_text'].iloc[i]}\n")


# --- 4. Feature Extraction: TF-IDF ---
# Converting text into numerical vectors is crucial for ML models.
# TF-IDF (Term Frequency-Inverse Document Frequency) gives more weight to unique words.
print("\n--- Step 4: Feature Extraction (TF-IDF) ---")
X = df['processed_text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# `stratify=y` is important for imbalanced datasets (more ham than spam)

tfidf_vectorizer = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.8)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test) # Transform, not fit_transform, for test set!

print(f"TF-IDF transformed training data shape: {X_train_tfidf.shape}")
print(f"TF-IDF transformed test data shape: {X_test_tfidf.shape}")


# --- 5. Model Training: Multinomial Naive Bayes ---
# A simple yet effective classifier for text data.
print("\n--- Step 5: Model Training (Multinomial Naive Bayes) ---")
mnb_classifier = MultinomialNB()
mnb_classifier.fit(X_train_tfidf, y_train)
print("Model training complete.")


# --- 6. Model Evaluation: Beyond Just Accuracy ---
# For imbalanced datasets, precision, recall, and F1-score are more telling.
print("\n--- Step 6: Model Evaluation ---")
y_pred = mnb_classifier.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred) # How many predicted spam were actually spam?
recall = recall_score(y_test, y_pred)       # How many actual spam were caught?
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (for Spam): {precision:.4f}")
print(f"Recall (for Spam): {recall:.4f}")
print(f"F1-Score (for Spam): {f1:.4f}")
print("\nConfusion Matrix:\n", conf_matrix)
print("  [[TN, FP]")
print("   [FN, TP]]")
print(f"  False Positives (Ham identified as spam): {conf_matrix[0,1]}")
print(f"  False Negatives (Spam identified as ham): {conf_matrix[1,0]}")

# --- 7. Test with New Examples ---
print("\n--- Step 7: Testing with New Text Examples ---")
def predict_spam(text, model, vectorizer):
    processed_text = preprocess_text(text)
    vectorized_text = vectorizer.transform([processed_text])
    prediction = model.predict(vectorized_text)[0]
    return "Spam" if prediction == 1 else "Ham"

test_message_1 = "Congratulations! You've won a free £1000 cash prize!"
test_message_2 = "Hey, let's meet up for coffee tomorrow at 1 PM."
print(f"'{test_message_1}' -> Predicted: {predict_spam(test_message_1, mnb_classifier, tfidf_vectorizer)}")
print(f"'{test_message_2}' -> Predicted: {predict_spam(test_message_2, mnb_classifier, tfidf_vectorizer)}")

### The `punkt_tab` Mystery (And How I Solved It)

After fixing the `DownloadError` for standard NLTK packages, I hit another wall:

Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt_tab')
Attempted to load tokenizers/punkt_tab/english/


This was puzzling because my code explicitly requested `'punkt'`, not `'punkt_tab'`. This error indicated that NLTK, for some reason (perhaps due to a system-wide configuration or a previous fragmented download), was looking for `punkt_tab` when `nltk.word_tokenize` was called.

**The Solution:**

The simplest and most effective solution was to **manually ensure both `punkt` and `punkt_tab` were downloaded** in my virtual environment, then let the script handle the rest. This often resolves underlying path or caching issues NLTK might have.

```

bash
# Activate your virtual environment first!
source venv/bin/activate # or .\venv\Scripts\activate

python -c "import nltk; nltk.download('punkt')"
python -c "import nltk; nltk.download('punkt_tab')" # Download this specific one too!


```

`

After doing this, my script ran without a hitch\! It seems sometimes NLTK can be a bit particular about its data files.

### Final Thoughts

Building this spam classifier was a fantastic learning experience. It reinforced the importance of:

  * **Robust Error Handling:** Especially when dealing with external resources like NLTK data.
  * **Thorough Text Preprocessing:** The quality of your features directly impacts model performance.
  * **Understanding Evaluation Metrics:** Accuracy isn't always enough, especially with imbalanced datasets. Precision and Recall give a much clearer picture of how well a spam filter performs.

I hope this post helps anyone else embarking on a similar NLP journey or facing similar NLTK download woes\!

What are your experiences with NLP projects? Share your tips and tricks in the comments below\!