Hey everyone! π
Today, I want to share my experience building a simple Spam Email Classifier using Python. This project is a fantastic entry point into Machine Learning and Natural Language Processing (NLP), offering practical skills and insights into how real-world spam filters might work. While seemingly straightforward, I hit a few common roadblocks, especially with setting up the NLP environment, and I'll walk you through how I tackled them.
Why a Spam Classifier?
Beyond being a great learning exercise, spam classification is a perfect example of a binary classification problem (spam vs. ham/not spam) with text data. It involves:
- Data Preprocessing: Cleaning messy, unstructured text.
- Feature Engineering: Converting text into numerical data that machines can understand.
- Model Training: Applying classification algorithms.
- Evaluation: Understanding how well our model performs.
Let's dive in!
The Tools We'll Use
- Python: The language of choice for ML.
-
pandas
: For data loading and manipulation. -
scikit-learn
: The go-to library for machine learning algorithms and utilities. -
nltk
(Natural Language Toolkit): Essential for text preprocessing. -
Regular Expressions (
re
): For advanced text pattern matching.
Step 1: Setting Up Your Environment
This is often where the first hurdles appear! I highly recommend using a virtual environment to keep your project dependencies isolated.
# 1. Create your project folder
mkdir spam_classifier_project
cd spam_classifier_project
# 2. Create a virtual environment
python -m venv venv
# 3. Activate the virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# 4. Create a requirements.txt file
# requirements.txt content:
# pandas>=2.0.0
# scikit-learn>=1.0.0
# nltk>=3.8.0
# 5. Install dependencies
pip install -r requirements.txt
Step 2: Data Acquisition (and the Dataset Link!)
For this project, I used the SMS Spam Collection Dataset from Kaggle. It's perfectly suited because SMS messages are similar to short emails, making the text processing techniques directly applicable.
- Dataset Link: SMS Spam Collection Dataset on Kaggle
- Download: You'll likely need a free Kaggle account to download.
-
Placement: Place the downloaded
spam.csv
file directly into yourspam_classifier_project
folder.
Step 3: The Code Breakdown
Here's the full script. I'll highlight key sections and the tricky bits I encountered.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
from urllib.error import URLError # Crucial for handling NLTK download errors!
## --- 0. Robust NLTK Data Download ---
##### This was my first major hurdle! NLTK needs specific data files (like stopwords, tokenizers).
##### Initially, I hit an 'AttributeError: module 'nltk.downloader' has no attribute 'DownloadError''
##### The fix? Catching a more general Exception or URLError.
print("--- 0. Checking/Downloading NLTK Data ---")
def safe_nltk_download(package_name):
try:
if package_name == 'punkt':
nltk.data.find(f'tokenizers/{package_name}')
else:
nltk.data.find(f'corpora/{package_name}')
print(f"'{package_name}' already downloaded.")
except (URLError, Exception) as e:
print(f"'{package_name}' not found. Attempting to download...")
try:
nltk.download(package_name)
print(f"'{package_name}' downloaded successfully.")
except Exception as e_download:
print(f"Failed to download '{package_name}': {e_download}")
print(f"Please try running 'python -c \"import nltk; nltk.download('{package_name}')\"' manually in your terminal.")
safe_nltk_download('stopwords')
safe_nltk_download('punkt')
safe_nltk_download('wordnet')
safe_nltk_download('omw-1.4')
print("--- NLTK Data Check Complete ---")
# --- 1. Dataset Loading ---
print("\n--- Step 1: Dataset Acquisition and Loading ---")
try:
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'text']
print("Dataset loaded successfully.")
except FileNotFoundError:
print("\nError: 'spam.csv' not found. Ensure it's in the same directory.")
exit()
# --- 2. Exploratory Data Analysis (EDA) ---
# Simple checks for shape, head, label distribution, and missing values.
print("\n--- Step 2: Exploratory Data Analysis (EDA) ---")
print(f"Dataset shape: {df.shape}")
print(df.head())
print("\nLabel distribution:")
print(df['label'].value_counts())
df['label'] = df['label'].map({'ham': 0, 'spam': 1}) # Convert to numerical
print("\nLabels converted (ham=0, spam=1):")
print(df['label'].value_counts())
# --- 3. Text Pre-processing: The NLP Core ---
# This is where we clean the raw text.
print("\n--- Step 3: Text Pre-processing ---")
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
text = text.lower() # Lowercasing
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\d+', '', text) # Remove numbers
tokens = nltk.word_tokenize(text) # Tokenization
# Stop word removal & Lemmatization (better than stemming for readability/accuracy)
cleaned_tokens = [
lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words
]
return " ".join(cleaned_tokens)
df['processed_text'] = df['text'].apply(preprocess_text)
print("\nOriginal vs Processed Text (first 3 examples):")
for i in range(3):
print(f"Original: {df['text'].iloc[i]}")
print(f"Processed: {df['processed_text'].iloc[i]}\n")
# --- 4. Feature Extraction: TF-IDF ---
# Converting text into numerical vectors is crucial for ML models.
# TF-IDF (Term Frequency-Inverse Document Frequency) gives more weight to unique words.
print("\n--- Step 4: Feature Extraction (TF-IDF) ---")
X = df['processed_text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# `stratify=y` is important for imbalanced datasets (more ham than spam)
tfidf_vectorizer = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.8)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test) # Transform, not fit_transform, for test set!
print(f"TF-IDF transformed training data shape: {X_train_tfidf.shape}")
print(f"TF-IDF transformed test data shape: {X_test_tfidf.shape}")
# --- 5. Model Training: Multinomial Naive Bayes ---
# A simple yet effective classifier for text data.
print("\n--- Step 5: Model Training (Multinomial Naive Bayes) ---")
mnb_classifier = MultinomialNB()
mnb_classifier.fit(X_train_tfidf, y_train)
print("Model training complete.")
# --- 6. Model Evaluation: Beyond Just Accuracy ---
# For imbalanced datasets, precision, recall, and F1-score are more telling.
print("\n--- Step 6: Model Evaluation ---")
y_pred = mnb_classifier.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred) # How many predicted spam were actually spam?
recall = recall_score(y_test, y_pred) # How many actual spam were caught?
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (for Spam): {precision:.4f}")
print(f"Recall (for Spam): {recall:.4f}")
print(f"F1-Score (for Spam): {f1:.4f}")
print("\nConfusion Matrix:\n", conf_matrix)
print(" [[TN, FP]")
print(" [FN, TP]]")
print(f" False Positives (Ham identified as spam): {conf_matrix[0,1]}")
print(f" False Negatives (Spam identified as ham): {conf_matrix[1,0]}")
# --- 7. Test with New Examples ---
print("\n--- Step 7: Testing with New Text Examples ---")
def predict_spam(text, model, vectorizer):
processed_text = preprocess_text(text)
vectorized_text = vectorizer.transform([processed_text])
prediction = model.predict(vectorized_text)[0]
return "Spam" if prediction == 1 else "Ham"
test_message_1 = "Congratulations! You've won a free Β£1000 cash prize!"
test_message_2 = "Hey, let's meet up for coffee tomorrow at 1 PM."
print(f"'{test_message_1}' -> Predicted: {predict_spam(test_message_1, mnb_classifier, tfidf_vectorizer)}")
print(f"'{test_message_2}' -> Predicted: {predict_spam(test_message_2, mnb_classifier, tfidf_vectorizer)}")
### The `punkt_tab` Mystery (And How I Solved It)
After fixing the `DownloadError` for standard NLTK packages, I hit another wall:
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt_tab')
Attempted to load tokenizers/punkt_tab/english/
This was puzzling because my code explicitly requested `'punkt'`, not `'punkt_tab'`. This error indicated that NLTK, for some reason (perhaps due to a system-wide configuration or a previous fragmented download), was looking for `punkt_tab` when `nltk.word_tokenize` was called.
**The Solution:**
The simplest and most effective solution was to **manually ensure both `punkt` and `punkt_tab` were downloaded** in my virtual environment, then let the script handle the rest. This often resolves underlying path or caching issues NLTK might have.
```
bash
# Activate your virtual environment first!
source venv/bin/activate # or .\venv\Scripts\activate
python -c "import nltk; nltk.download('punkt')"
python -c "import nltk; nltk.download('punkt_tab')" # Download this specific one too!
```
`
After doing this, my script ran without a hitch\! It seems sometimes NLTK can be a bit particular about its data files.
### Final Thoughts
Building this spam classifier was a fantastic learning experience. It reinforced the importance of:
* **Robust Error Handling:** Especially when dealing with external resources like NLTK data.
* **Thorough Text Preprocessing:** The quality of your features directly impacts model performance.
* **Understanding Evaluation Metrics:** Accuracy isn't always enough, especially with imbalanced datasets. Precision and Recall give a much clearer picture of how well a spam filter performs.
I hope this post helps anyone else embarking on a similar NLP journey or facing similar NLTK download woes\!
What are your experiences with NLP projects? Share your tips and tricks in the comments below\!
Top comments (0)