DEV Community

M-Rafay
M-Rafay

Posted on

Building a Simple Spam Classifier in Python: A Journey Through NLP Basics and Common Pitfalls Hey everyone!

Hey everyone! πŸ‘‹

Today, I want to share my experience building a simple Spam Email Classifier using Python. This project is a fantastic entry point into Machine Learning and Natural Language Processing (NLP), offering practical skills and insights into how real-world spam filters might work. While seemingly straightforward, I hit a few common roadblocks, especially with setting up the NLP environment, and I'll walk you through how I tackled them.

Why a Spam Classifier?

Beyond being a great learning exercise, spam classification is a perfect example of a binary classification problem (spam vs. ham/not spam) with text data. It involves:

  • Data Preprocessing: Cleaning messy, unstructured text.
  • Feature Engineering: Converting text into numerical data that machines can understand.
  • Model Training: Applying classification algorithms.
  • Evaluation: Understanding how well our model performs.

Let's dive in!

The Tools We'll Use

  • Python: The language of choice for ML.
  • pandas: For data loading and manipulation.
  • scikit-learn: The go-to library for machine learning algorithms and utilities.
  • nltk (Natural Language Toolkit): Essential for text preprocessing.
  • Regular Expressions (re): For advanced text pattern matching.

Step 1: Setting Up Your Environment

This is often where the first hurdles appear! I highly recommend using a virtual environment to keep your project dependencies isolated.

# 1. Create your project folder
mkdir spam_classifier_project
cd spam_classifier_project

# 2. Create a virtual environment
python -m venv venv

# 3. Activate the virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# 4. Create a requirements.txt file
# requirements.txt content:
# pandas>=2.0.0
# scikit-learn>=1.0.0
# nltk>=3.8.0

# 5. Install dependencies
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Step 2: Data Acquisition (and the Dataset Link!)

For this project, I used the SMS Spam Collection Dataset from Kaggle. It's perfectly suited because SMS messages are similar to short emails, making the text processing techniques directly applicable.

  • Dataset Link: SMS Spam Collection Dataset on Kaggle
  • Download: You'll likely need a free Kaggle account to download.
  • Placement: Place the downloaded spam.csv file directly into your spam_classifier_project folder.

Step 3: The Code Breakdown

Here's the full script. I'll highlight key sections and the tricky bits I encountered.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
from urllib.error import URLError # Crucial for handling NLTK download errors!
Enter fullscreen mode Exit fullscreen mode
## --- 0. Robust NLTK Data Download ---
##### This was my first major hurdle! NLTK needs specific data files (like stopwords, tokenizers).
##### Initially, I hit an 'AttributeError: module 'nltk.downloader' has no attribute 'DownloadError''
##### The fix? Catching a more general Exception or URLError.

print("--- 0. Checking/Downloading NLTK Data ---")
def safe_nltk_download(package_name):
    try:
        if package_name == 'punkt':
            nltk.data.find(f'tokenizers/{package_name}')
        else:
            nltk.data.find(f'corpora/{package_name}')
        print(f"'{package_name}' already downloaded.")
    except (URLError, Exception) as e:
        print(f"'{package_name}' not found. Attempting to download...")
        try:
            nltk.download(package_name)
            print(f"'{package_name}' downloaded successfully.")
        except Exception as e_download:
            print(f"Failed to download '{package_name}': {e_download}")
            print(f"Please try running 'python -c \"import nltk; nltk.download('{package_name}')\"' manually in your terminal.")

safe_nltk_download('stopwords')
safe_nltk_download('punkt')
safe_nltk_download('wordnet')
safe_nltk_download('omw-1.4')
print("--- NLTK Data Check Complete ---")


# --- 1. Dataset Loading ---
print("\n--- Step 1: Dataset Acquisition and Loading ---")
try:
    df = pd.read_csv('spam.csv', encoding='latin-1')
    df = df[['v1', 'v2']]
    df.columns = ['label', 'text']
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("\nError: 'spam.csv' not found. Ensure it's in the same directory.")
    exit()

# --- 2. Exploratory Data Analysis (EDA) ---
# Simple checks for shape, head, label distribution, and missing values.
print("\n--- Step 2: Exploratory Data Analysis (EDA) ---")
print(f"Dataset shape: {df.shape}")
print(df.head())
print("\nLabel distribution:")
print(df['label'].value_counts())
df['label'] = df['label'].map({'ham': 0, 'spam': 1}) # Convert to numerical
print("\nLabels converted (ham=0, spam=1):")
print(df['label'].value_counts())


# --- 3. Text Pre-processing: The NLP Core ---
# This is where we clean the raw text.
print("\n--- Step 3: Text Pre-processing ---")
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower() # Lowercasing
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
    text = re.sub(r'\d+', '', text) # Remove numbers
    tokens = nltk.word_tokenize(text) # Tokenization
    # Stop word removal & Lemmatization (better than stemming for readability/accuracy)
    cleaned_tokens = [
        lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words
    ]
    return " ".join(cleaned_tokens)

df['processed_text'] = df['text'].apply(preprocess_text)
print("\nOriginal vs Processed Text (first 3 examples):")
for i in range(3):
    print(f"Original: {df['text'].iloc[i]}")
    print(f"Processed: {df['processed_text'].iloc[i]}\n")


# --- 4. Feature Extraction: TF-IDF ---
# Converting text into numerical vectors is crucial for ML models.
# TF-IDF (Term Frequency-Inverse Document Frequency) gives more weight to unique words.
print("\n--- Step 4: Feature Extraction (TF-IDF) ---")
X = df['processed_text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# `stratify=y` is important for imbalanced datasets (more ham than spam)

tfidf_vectorizer = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.8)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test) # Transform, not fit_transform, for test set!

print(f"TF-IDF transformed training data shape: {X_train_tfidf.shape}")
print(f"TF-IDF transformed test data shape: {X_test_tfidf.shape}")


# --- 5. Model Training: Multinomial Naive Bayes ---
# A simple yet effective classifier for text data.
print("\n--- Step 5: Model Training (Multinomial Naive Bayes) ---")
mnb_classifier = MultinomialNB()
mnb_classifier.fit(X_train_tfidf, y_train)
print("Model training complete.")


# --- 6. Model Evaluation: Beyond Just Accuracy ---
# For imbalanced datasets, precision, recall, and F1-score are more telling.
print("\n--- Step 6: Model Evaluation ---")
y_pred = mnb_classifier.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred) # How many predicted spam were actually spam?
recall = recall_score(y_test, y_pred)       # How many actual spam were caught?
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (for Spam): {precision:.4f}")
print(f"Recall (for Spam): {recall:.4f}")
print(f"F1-Score (for Spam): {f1:.4f}")
print("\nConfusion Matrix:\n", conf_matrix)
print("  [[TN, FP]")
print("   [FN, TP]]")
print(f"  False Positives (Ham identified as spam): {conf_matrix[0,1]}")
print(f"  False Negatives (Spam identified as ham): {conf_matrix[1,0]}")

# --- 7. Test with New Examples ---
print("\n--- Step 7: Testing with New Text Examples ---")
def predict_spam(text, model, vectorizer):
    processed_text = preprocess_text(text)
    vectorized_text = vectorizer.transform([processed_text])
    prediction = model.predict(vectorized_text)[0]
    return "Spam" if prediction == 1 else "Ham"

test_message_1 = "Congratulations! You've won a free Β£1000 cash prize!"
test_message_2 = "Hey, let's meet up for coffee tomorrow at 1 PM."
print(f"'{test_message_1}' -> Predicted: {predict_spam(test_message_1, mnb_classifier, tfidf_vectorizer)}")
print(f"'{test_message_2}' -> Predicted: {predict_spam(test_message_2, mnb_classifier, tfidf_vectorizer)}")

### The `punkt_tab` Mystery (And How I Solved It)

After fixing the `DownloadError` for standard NLTK packages, I hit another wall:

Enter fullscreen mode Exit fullscreen mode

Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt_tab')
Attempted to load tokenizers/punkt_tab/english/



This was puzzling because my code explicitly requested `'punkt'`, not `'punkt_tab'`. This error indicated that NLTK, for some reason (perhaps due to a system-wide configuration or a previous fragmented download), was looking for `punkt_tab` when `nltk.word_tokenize` was called.

**The Solution:**

The simplest and most effective solution was to **manually ensure both `punkt` and `punkt_tab` were downloaded** in my virtual environment, then let the script handle the rest. This often resolves underlying path or caching issues NLTK might have.

```

bash
# Activate your virtual environment first!
source venv/bin/activate # or .\venv\Scripts\activate

python -c "import nltk; nltk.download('punkt')"
python -c "import nltk; nltk.download('punkt_tab')" # Download this specific one too!


```

`

After doing this, my script ran without a hitch\! It seems sometimes NLTK can be a bit particular about its data files.

### Final Thoughts

Building this spam classifier was a fantastic learning experience. It reinforced the importance of:

  * **Robust Error Handling:** Especially when dealing with external resources like NLTK data.
  * **Thorough Text Preprocessing:** The quality of your features directly impacts model performance.
  * **Understanding Evaluation Metrics:** Accuracy isn't always enough, especially with imbalanced datasets. Precision and Recall give a much clearer picture of how well a spam filter performs.

I hope this post helps anyone else embarking on a similar NLP journey or facing similar NLTK download woes\!

What are your experiences with NLP projects? Share your tips and tricks in the comments below\!
Enter fullscreen mode Exit fullscreen mode

Top comments (0)