This guide details an end-to-end Machine Learning pipeline for email spam classification, covering text preprocessing, comparative evaluations between Naive Bayes and fine-tuned RoBERTa models, interactive visualization with Streamlit, and deployment using Docker.
Index
- Introduction and Overview
- Dataset Ingestion and Preprocessing
- Vocabulary Building and Filtering
- Feature Extraction and Engineering
- Model Training and Serialization
- Comparative Analysis: Naive Bayes vs. Fine-Tuned RoBERTa Models
- Interactive Streamlit Interface
- Docker Containerization and Deployment
- Project Repository and Resources
Introduction and Overview
Email spam detection is a classic text classification problem in Machine Learning. The objective is to automatically classify incoming emails as either Spam (unsolicited bulk messages) or Ham (legitimate personal or professional messages).
While this project trains a custom Multinomial Naive Bayes classifier—a classical bag-of-words method highly suited for word-frequency feature matrices—it also integrates and evaluates advanced pre-trained Transformer-based RoBERTa models from Hugging Face (specifically dima806/email-spam-detection-roberta and roshana1s/spam-message-classifier). This comparison highlights the differences in processing raw text contextually versus counting word frequencies.
The architecture of the project is divided into three main components:
- Model Training Pipeline: A training script that cleans the dataset, constructs a vocabulary of the most common words, generates bag-of-words features, trains the Naive Bayes classifier, and serializes the model artifacts.
- Interactive User Interface: A Streamlit web application that loads the saved Naive Bayes artifacts along with the fine-tuned RoBERTa model to provide a user-friendly side-by-side inference interface.
- Deployment: A containerized environment configured in the Docker build instructions that packages the application and its dependencies to run on CPU-only or GPU-accelerated hosts.
System Architecture Diagram
Dataset Ingestion and Preprocessing
The project starts by consolidating multiple raw datasets into a single clean dataset using a dedicated preprocessing script. This ensures uniform text formatting and standardized labeling across different data sources.
1. Processing Raw Datasets
Three different raw spam and ham email collections are loaded and cleaned:
- For the first two datasets, we extract the label and text columns, replace multiple spaces with a single space, strip leading/trailing whitespace, lowercase the text, and drop null values.
- For the third dataset, we additionally strip the standard case-insensitive prefix
Subject:orsubject:from the text and convert numeric labels (0toham, and all other values tospam).
import pandas as pd
# Load and clean the first dataset
with open('Dataset/spam.csv', 'r', encoding='utf-8', errors='ignore') as f:
df1 = pd.read_csv(f, usecols=[0, 1], names=['label', 'text'], header=0)
df1['text'] = df1['text'].str.replace(r'\s+', ' ', regex=True).str.strip().str.lower()
df1 = df1[['text', 'label']].dropna()
# Load and clean the second dataset
with open('Dataset/spam1.csv', 'r', encoding='utf-8', errors='ignore') as f:
df2 = pd.read_csv(f, usecols=[0, 1], names=['label', 'text'], header=0)
df2['text'] = df2['text'].str.replace(r'\s+', ' ', regex=True).str.strip().str.lower()
df2 = df2[['text', 'label']].dropna()
# Load and clean the third dataset (specifically stripping the 'Subject:' prefix)
with open('Dataset/emails.csv', 'r', encoding='utf-8', errors='ignore') as f:
df3 = pd.read_csv(f, usecols=[0, 1], names=['text', 'label'], header=0)
df3['text'] = (df3['text']
.str.replace(r'(?i)^\s*subject[:\s]+', '', regex=True)
.str.replace(r'\s+', ' ', regex=True)
.str.strip()
.str.lower())
df3 = df3.dropna()
df3["label"] = df3["label"].apply(lambda x: 'ham' if x == 0 else 'spam')
2. Combining the Datasets
Once cleaned, we concatenate all three processed datasets and save the output as a unified CSV file:
combined_df = pd.concat([df1, df2, df3], ignore_index=True)
combined_df.to_csv('Dataset/combined_spam_new.csv', index=False)
Vocabulary Building and Filtering
We load the combined dataset, clean the raw texts, and build a dictionary of words that will serve as features for our machine learning model.
1. Ingestion and Cleaning
We load the dataset, drop missing rows, and tokenize the texts by splitting them on spaces. To clean the vocabulary, we retain only alphabetic words:
data = pd.read_csv("Dataset/combined_spam.csv")
data_clean = data.dropna()
words = []
for row in data_clean['text']:
words += row.split(" ")
for i in range(len(words)):
if not words[i].isalpha():
words[i] = ""
2. Frequency Filtering
Using a Counter object, we count word occurrences, remove empty token placeholders, and extract the 3,000 most common words. This vocabulary acts as our feature set:
from collections import Counter
word_dict = Counter(words)
del word_dict['']
word_dict = word_dict.most_common(3000)
Feature Extraction and Engineering
With our 3,000-word vocabulary established, we map each email text to a numerical vector using a Bag-of-Words (BoW) representation.
1. Building the Feature Matrix
For each email, we create a count vector where each element represents the number of times a word from our 3,000-word vocabulary appears in the email.
feature_matrix = []
labels = []
for text, label in data_clean[['text', 'label']].values:
data_count = []
row_words = text.split(" ")
for word in word_dict:
data_count.append(row_words.count(word[0]))
feature_matrix.append(data_count)
if 'spam' in label:
labels.append(1)
if 'ham' in label:
labels.append(0)
feature_matrix = np.array(feature_matrix)
labels = np.array(labels)
Model Training and Serialization
Once the feature matrix and label arrays are generated, we split the data into training and validation sets, train a Naive Bayes classifier, and serialize the trained model.
1. Splitting and Training
We use train_test_split from scikit-learn to split the dataset, reserving 20% of the data for testing. We then fit a MultinomialNB model.
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(
feature_matrix, labels, test_size=0.2, random_state=9
)
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(f"Model Accuracy: {accuracy:.4f}")
2. Serialization
The vocabulary dictionary and the trained classifier are saved using Python's pickle module so that they can be loaded by the Streamlit application.
import os
import pickle
os.makedirs('models', exist_ok=True)
with open('models/word_dict.pkl', 'wb') as f:
pickle.dump(word_dict, f, pickle.HIGHEST_PROTOCOL)
with open('models/nb_classifier.pkl', 'wb') as f:
pickle.dump(classifier, f, protocol=pickle.HIGHEST_PROTOCOL)
Comparative Analysis: Naive Bayes vs. Fine-Tuned RoBERTa Models
In addition to the custom-trained Naive Bayes model, the project evaluates two pre-trained deep-learning Transformer models using the Hugging Face pipeline interface to compare performance and explore model edge cases.
1. Fine-Tuned Transformer Models Evaluated
-
Model 1:
dima806/email-spam-detection-roberta(Accuracy on test set: 79.58%) -
Model 2:
roshana1s/spam-message-classifier(Accuracy on test set: 84.45%)
Both models are loaded in CPU-only mode:
from transformers import pipeline
import os
# Disable CUDA to run on CPU
os.environ['CUDA_VISIBLE_DEVICES'] = ''
# Load fine-tuned models
spam_roberta_1 = pipeline("text-classification", model="dima806/email-spam-detection-roberta", device=-1)
spam_roberta_2 = pipeline("text-classification", model="roshana1s/spam-message-classifier", device=-1)
# Run inference on sample email text (truncating long sequences to 512 tokens)
text_sample = "Your email text goes here..."
result_1 = spam_roberta_1(text_sample, truncation=True, max_length=512)
result_2 = spam_roberta_2(text_sample, truncation=True, max_length=512)
print("Model 1 Result:", result_1)
print("Model 2 Result:", result_2)
2. Sincere Text Evaluation & Model Selection
While Model 2 has a higher overall test set accuracy (84.45% compared to 79.58%), evaluating these models on qualitative edge cases reveals significant behavior differences.
A test was conducted using a highly romantic, sincere declaration of love as input text:
"I love you! You are the best person in the world. I am so happy to have you in my life. You are my sunshine and my everything. I will always love you and be there for you. You are my soulmate and my best friend. I am so grateful to have you in my life. I love you more than words can express. You are the love of my life and I will always cherish you. I am so lucky to have you as my partner. I love you with all my heart and soul. You are the most amazing person I have ever met and I am so blessed to have you in my life."
Running this sample text through the classifiers yielded:
- Naive Bayes (Bag-of-Words): Predicted as Spam
- Model 1 (dima806): Predicted as Spam with 97.15% confidence
- Model 2 (roshana1s): Predicted as Ham with 73.02% confidence
Why Model 2 is Preferred
The Naive Bayes model and the first RoBERTa model mistakenly flag this heartfelt email as spam because they associate words like "cherish", "partner", "love", and "best" with spam correlations. Under a real-world configuration, this would lead to filtering out genuine personal messages or declarations of affection.
Model 2 is the only model that successfully processes the contextual meaning of the message and classifies it as Ham. Because of this contextual robustness and its higher overall test accuracy, Model 2 is strongly preferred over the other classifiers.
Interactive Streamlit Interface
The interactive Streamlit application wraps the Naive Bayes and fine-tuned RoBERTa model in a unified web dashboard, permitting side-by-side comparative spam predictions.
1. Model Loading and Feature Caching
To optimize performance, we utilize Streamlit's @st.cache_resource and @st.cache_data decorators. This ensures the Naive Bayes classifier, vocabulary, and RoBERTa pipeline are only loaded once, and the bag-of-words vectorization results are cached. We configure the app to use a wide screen layout using the st.set_page_config(layout="wide") command:
import pickle
import numpy as np
import streamlit as st
from transformers import pipeline
st.set_page_config(layout="wide")
WORD_DICT_PATH = 'models/word_dict.pkl'
CLASSIFIER_PATH = 'models/nb_classifier.pkl'
@st.cache_resource
def load_artifacts():
with open(WORD_DICT_PATH, 'rb') as f:
word_dict = pickle.load(f)
words = [w[0] for w in word_dict]
with open(CLASSIFIER_PATH, 'rb') as f:
clf = pickle.load(f)
return words, clf
@st.cache_data
def text_to_features(text, words):
row_words = text.split()
return np.array([row_words.count(w) for w in words]).reshape(1, -1)
@st.cache_resource
def load_roberta():
return pipeline("text-classification", model="roshana1s/spam-message-classifier")
2. Multi-Model Inference Dashboard
The UI sets up a wide layout text input and evaluates the user's text concurrently. The Naive Bayes and RoBERTa predictions, along with their respective probability scores, are rendered side-by-side in two layout columns:
st.title("Email Spam Classifier")
words, clf = load_artifacts()
roberta_model = load_roberta()
placeholder_email = "Congratulations! You won a free ticket, click here"
st.markdown("### Enter Email Text")
text = st.text_area("Email text", placeholder=placeholder_email, height=150)
if st.button("Predict"):
if not text.strip():
st.warning("Please enter email text to classify.")
else:
col1, col2 = st.columns(2)
# 1. Naive Bayes Path
X = text_to_features(text, words)
pred = clf.predict(X)[0]
label = "spam" if pred == 1 else "ham"
col1.markdown("### Naive Bayes Prediction")
col1.markdown(f"**Prediction:** {label}")
if hasattr(clf, "predict_proba"):
probs = clf.predict_proba(X)[0]
col1.markdown(f"**Probabilities:** Ham: {probs[0]:.3f}, Spam: {probs[1]:.3f}")
# 2. RoBERTa Path
roberta_result = roberta_model(text)[0]
col2.markdown("### RoBERTa Prediction")
col2.markdown(f"**Prediction:** {roberta_result['label']}")
label = roberta_result['label'].lower()
if label == 'ham':
ham_score = roberta_result['score']
spam_score = 1 - roberta_result['score']
else:
ham_score = 1 - roberta_result['score']
spam_score = roberta_result['score']
col2.markdown(f"**Probabilities:** Ham: {ham_score:.3f}, Spam: {spam_score:.3f}")
Streamlit UI Preview
Docker Containerization and Deployment
To run the Streamlit app consistently across environments, the project is packaged into a Docker container.
1. Dockerfile Analysis
The Docker configuration file supports multi-variant builds (CPU and GPU) through build arguments. By default, it sets up a CPU variant, but it can download CUDA-compiled PyTorch wheels for GPU-accelerated environments:
FROM python:3.10-slim
ARG PYTORCH_VARIANT=
WORKDIR /app
COPY requirements${PYTORCH_VARIANT:+-$PYTORCH_VARIANT}.txt requirements.txt
RUN if [ -z "$PYTORCH_VARIANT" ]; then \
pip install --no-cache-dir -r requirements.txt; \
else \
pip install --no-cache-dir -r requirements.txt; \
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132; \
fi
COPY models /app/models
COPY Load_Model.py /app
EXPOSE 8501
CMD ["streamlit", "run", "Load_Model.py", "--server.port=8501", "--server.address=0.0.0.0"]
2. Running Pre-built Images from Registry
Instead of building locally, you can pull pre-built images directly from the registry:
- Pull CPU Image:
docker pull ghcr.io/preyumkr/email-spam-classifier:latest
- Pull GPU Image (CUDA 13.2 support):
docker pull ghcr.io/preyumkr/email-spam-classifier:latest-gpu
Running CPU Container
To start the application on a CPU-only host:
docker run -d --name email-spam -p 8501:8501 ghcr.io/preyumkr/email-spam-classifier:latest
Running GPU Container (CUDA 13.2 support)
To leverage GPU acceleration for the transformer model, ensure the following Prerequisites are met:
- NVIDIA GPU available on the host machine.
-
NVIDIA Container Toolkit installed to support the
--gpusflag. You can install it on Ubuntu/Debian using:
# Download GPG key and add stable repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Update apt and install toolkit
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# Restart Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
- Verify runtime compatibility:
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
Once verified, run the GPU-enabled container using:
docker run -d --name email-spam-gpu -p 8501:8501 --gpus all ghcr.io/preyumkr/email-spam-classifier:latest-gpu
Project Repository and Resources
Visit my GitHub repository for the full source code and the pre-built Docker image at https://github.com/PreyumKr/Email_Spam_Classifier. The dataset is also included in the repository.


Top comments (0)