Preyum Kumar

Posted on Jun 3

Email Spam Classifier with Streamlit and Docker

#ai #docker #streamlit #nlp

This guide details an end-to-end Machine Learning pipeline for email spam classification, covering text preprocessing, comparative evaluations between Naive Bayes and fine-tuned RoBERTa models, interactive visualization with Streamlit, and deployment using Docker.

Index

Introduction and Overview
Dataset Ingestion and Preprocessing
Vocabulary Building and Filtering
Feature Extraction and Engineering
Model Training and Serialization
Comparative Analysis: Naive Bayes vs. Fine-Tuned RoBERTa Models
Interactive Streamlit Interface
Docker Containerization and Deployment
Project Repository and Resources

Introduction and Overview

Email spam detection is a classic text classification problem in Machine Learning. The objective is to automatically classify incoming emails as either Spam (unsolicited bulk messages) or Ham (legitimate personal or professional messages).

While this project trains a custom Multinomial Naive Bayes classifier—a classical bag-of-words method highly suited for word-frequency feature matrices—it also integrates and evaluates advanced pre-trained Transformer-based RoBERTa models from Hugging Face (specifically dima806/email-spam-detection-roberta and roshana1s/spam-message-classifier). This comparison highlights the differences in processing raw text contextually versus counting word frequencies.

The architecture of the project is divided into three main components:

Model Training Pipeline: A training script that cleans the dataset, constructs a vocabulary of the most common words, generates bag-of-words features, trains the Naive Bayes classifier, and serializes the model artifacts.
Interactive User Interface: A Streamlit web application that loads the saved Naive Bayes artifacts along with the fine-tuned RoBERTa model to provide a user-friendly side-by-side inference interface.
Deployment: A containerized environment configured in the Docker build instructions that packages the application and its dependencies to run on CPU-only or GPU-accelerated hosts.

System Architecture Diagram

Dataset Ingestion and Preprocessing

The project starts by consolidating multiple raw datasets into a single clean dataset using a dedicated preprocessing script. This ensures uniform text formatting and standardized labeling across different data sources.

1. Processing Raw Datasets

Three different raw spam and ham email collections are loaded and cleaned:

For the first two datasets, we extract the label and text columns, replace multiple spaces with a single space, strip leading/trailing whitespace, lowercase the text, and drop null values.
For the third dataset, we additionally strip the standard case-insensitive prefix Subject: or subject: from the text and convert numeric labels (0 to ham, and all other values to spam).

import pandas as pd

# Load and clean the first dataset
with open('Dataset/spam.csv', 'r', encoding='utf-8', errors='ignore') as f:
    df1 = pd.read_csv(f, usecols=[0, 1], names=['label', 'text'], header=0)
df1['text'] = df1['text'].str.replace(r'\s+', ' ', regex=True).str.strip().str.lower()
df1 = df1[['text', 'label']].dropna()

# Load and clean the second dataset
with open('Dataset/spam1.csv', 'r', encoding='utf-8', errors='ignore') as f:
    df2 = pd.read_csv(f, usecols=[0, 1], names=['label', 'text'], header=0)
df2['text'] = df2['text'].str.replace(r'\s+', ' ', regex=True).str.strip().str.lower()
df2 = df2[['text', 'label']].dropna()

# Load and clean the third dataset (specifically stripping the 'Subject:' prefix)
with open('Dataset/emails.csv', 'r', encoding='utf-8', errors='ignore') as f:
    df3 = pd.read_csv(f, usecols=[0, 1], names=['text', 'label'], header=0)
df3['text'] = (df3['text']
               .str.replace(r'(?i)^\s*subject[:\s]+', '', regex=True)
               .str.replace(r'\s+', ' ', regex=True)
               .str.strip()
               .str.lower())
df3 = df3.dropna()
df3["label"] = df3["label"].apply(lambda x: 'ham' if x == 0 else 'spam')

2. Combining the Datasets

Once cleaned, we concatenate all three processed datasets and save the output as a unified CSV file:

combined_df = pd.concat([df1, df2, df3], ignore_index=True)
combined_df.to_csv('Dataset/combined_spam_new.csv', index=False)

Vocabulary Building and Filtering

We load the combined dataset, clean the raw texts, and build a dictionary of words that will serve as features for our machine learning model.

1. Ingestion and Cleaning

We load the dataset, drop missing rows, and tokenize the texts by splitting them on spaces. To clean the vocabulary, we retain only alphabetic words:

data = pd.read_csv("Dataset/combined_spam.csv")
data_clean = data.dropna()

words = []
for row in data_clean['text']:
    words += row.split(" ")

for i in range(len(words)):
    if not words[i].isalpha():
        words[i] = ""

2. Frequency Filtering

Using a Counter object, we count word occurrences, remove empty token placeholders, and extract the 3,000 most common words. This vocabulary acts as our feature set:

from collections import Counter

word_dict = Counter(words)
del word_dict['']
word_dict = word_dict.most_common(3000)

Feature Extraction and Engineering

With our 3,000-word vocabulary established, we map each email text to a numerical vector using a Bag-of-Words (BoW) representation.

1. Building the Feature Matrix

For each email, we create a count vector where each element represents the number of times a word from our 3,000-word vocabulary appears in the email.

feature_matrix = []
labels = []

for text, label in data_clean[['text', 'label']].values:
    data_count = []    
    row_words = text.split(" ")
    for word in word_dict:
        data_count.append(row_words.count(word[0]))
    feature_matrix.append(data_count)

    if 'spam' in label:
        labels.append(1)
    if 'ham' in label:
        labels.append(0)

feature_matrix = np.array(feature_matrix)
labels = np.array(labels)

Model Training and Serialization

Once the feature matrix and label arrays are generated, we split the data into training and validation sets, train a Naive Bayes classifier, and serialize the trained model.

1. Splitting and Training

We use train_test_split from scikit-learn to split the dataset, reserving 20% of the data for testing. We then fit a MultinomialNB model.

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    feature_matrix, labels, test_size=0.2, random_state=9
)

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(f"Model Accuracy: {accuracy:.4f}")

2. Serialization

The vocabulary dictionary and the trained classifier are saved using Python's pickle module so that they can be loaded by the Streamlit application.

import os
import pickle

os.makedirs('models', exist_ok=True)

with open('models/word_dict.pkl', 'wb') as f:
    pickle.dump(word_dict, f, pickle.HIGHEST_PROTOCOL)

with open('models/nb_classifier.pkl', 'wb') as f:
    pickle.dump(classifier, f, protocol=pickle.HIGHEST_PROTOCOL)

Comparative Analysis: Naive Bayes vs. Fine-Tuned RoBERTa Models

In addition to the custom-trained Naive Bayes model, the project evaluates two pre-trained deep-learning Transformer models using the Hugging Face pipeline interface to compare performance and explore model edge cases.

1. Fine-Tuned Transformer Models Evaluated

Model 1: dima806/email-spam-detection-roberta (Accuracy on test set: 79.58%)
Model 2: roshana1s/spam-message-classifier (Accuracy on test set: 84.45%)

Both models are loaded in CPU-only mode:

from transformers import pipeline
import os

# Disable CUDA to run on CPU
os.environ['CUDA_VISIBLE_DEVICES'] = ''

# Load fine-tuned models
spam_roberta_1 = pipeline("text-classification", model="dima806/email-spam-detection-roberta", device=-1)
spam_roberta_2 = pipeline("text-classification", model="roshana1s/spam-message-classifier", device=-1)

# Run inference on sample email text (truncating long sequences to 512 tokens)
text_sample = "Your email text goes here..."
result_1 = spam_roberta_1(text_sample, truncation=True, max_length=512)
result_2 = spam_roberta_2(text_sample, truncation=True, max_length=512)

print("Model 1 Result:", result_1)
print("Model 2 Result:", result_2)

2. Sincere Text Evaluation & Model Selection

While Model 2 has a higher overall test set accuracy (84.45% compared to 79.58%), evaluating these models on qualitative edge cases reveals significant behavior differences.

A test was conducted using a highly romantic, sincere declaration of love as input text:

"I love you! You are the best person in the world. I am so happy to have you in my life. You are my sunshine and my everything. I will always love you and be there for you. You are my soulmate and my best friend. I am so grateful to have you in my life. I love you more than words can express. You are the love of my life and I will always cherish you. I am so lucky to have you as my partner. I love you with all my heart and soul. You are the most amazing person I have ever met and I am so blessed to have you in my life."

Running this sample text through the classifiers yielded:

Naive Bayes (Bag-of-Words): Predicted as Spam
Model 1 (dima806): Predicted as Spam with 97.15% confidence
Model 2 (roshana1s): Predicted as Ham with 73.02% confidence

Why Model 2 is Preferred

The Naive Bayes model and the first RoBERTa model mistakenly flag this heartfelt email as spam because they associate words like "cherish", "partner", "love", and "best" with spam correlations. Under a real-world configuration, this would lead to filtering out genuine personal messages or declarations of affection.

Model 2 is the only model that successfully processes the contextual meaning of the message and classifies it as Ham. Because of this contextual robustness and its higher overall test accuracy, Model 2 is strongly preferred over the other classifiers.

Interactive Streamlit Interface

The interactive Streamlit application wraps the Naive Bayes and fine-tuned RoBERTa model in a unified web dashboard, permitting side-by-side comparative spam predictions.

1. Model Loading and Feature Caching

To optimize performance, we utilize Streamlit's @st.cache_resource and @st.cache_data decorators. This ensures the Naive Bayes classifier, vocabulary, and RoBERTa pipeline are only loaded once, and the bag-of-words vectorization results are cached. We configure the app to use a wide screen layout using the st.set_page_config(layout="wide") command:

import pickle
import numpy as np
import streamlit as st
from transformers import pipeline

st.set_page_config(layout="wide")

WORD_DICT_PATH = 'models/word_dict.pkl'
CLASSIFIER_PATH = 'models/nb_classifier.pkl'

@st.cache_resource
def load_artifacts():
    with open(WORD_DICT_PATH, 'rb') as f:
        word_dict = pickle.load(f) 
    words = [w[0] for w in word_dict]  

    with open(CLASSIFIER_PATH, 'rb') as f:
        clf = pickle.load(f)
    return words, clf

@st.cache_data
def text_to_features(text, words):
    row_words = text.split()
    return np.array([row_words.count(w) for w in words]).reshape(1, -1)

@st.cache_resource
def load_roberta():
    return pipeline("text-classification", model="roshana1s/spam-message-classifier")

2. Multi-Model Inference Dashboard

The UI sets up a wide layout text input and evaluates the user's text concurrently. The Naive Bayes and RoBERTa predictions, along with their respective probability scores, are rendered side-by-side in two layout columns:

st.title("Email Spam Classifier")
words, clf = load_artifacts()
roberta_model = load_roberta()

placeholder_email = "Congratulations! You won a free ticket, click here"
st.markdown("### Enter Email Text")
text = st.text_area("Email text", placeholder=placeholder_email, height=150)

if st.button("Predict"):
    if not text.strip():
        st.warning("Please enter email text to classify.")
    else:
        col1, col2 = st.columns(2)

        # 1. Naive Bayes Path
        X = text_to_features(text, words)
        pred = clf.predict(X)[0]
        label = "spam" if pred == 1 else "ham"

        col1.markdown("### Naive Bayes Prediction")
        col1.markdown(f"**Prediction:** {label}")
        if hasattr(clf, "predict_proba"):
            probs = clf.predict_proba(X)[0]
            col1.markdown(f"**Probabilities:** Ham: {probs[0]:.3f}, Spam: {probs[1]:.3f}")

        # 2. RoBERTa Path
        roberta_result = roberta_model(text)[0]
        col2.markdown("### RoBERTa Prediction")
        col2.markdown(f"**Prediction:** {roberta_result['label']}")

        label = roberta_result['label'].lower()
        if label == 'ham':
            ham_score = roberta_result['score']
            spam_score = 1 - roberta_result['score']
        else:
            ham_score = 1 - roberta_result['score']
            spam_score = roberta_result['score']
        col2.markdown(f"**Probabilities:** Ham: {ham_score:.3f}, Spam: {spam_score:.3f}")

Streamlit UI Preview

Docker Containerization and Deployment

To run the Streamlit app consistently across environments, the project is packaged into a Docker container.

1. Dockerfile Analysis

The Docker configuration file supports multi-variant builds (CPU and GPU) through build arguments. By default, it sets up a CPU variant, but it can download CUDA-compiled PyTorch wheels for GPU-accelerated environments:

FROM python:3.10-slim

ARG PYTORCH_VARIANT=

WORKDIR /app

COPY requirements${PYTORCH_VARIANT:+-$PYTORCH_VARIANT}.txt requirements.txt
RUN if [ -z "$PYTORCH_VARIANT" ]; then \
      pip install --no-cache-dir -r requirements.txt; \
    else \
      pip install --no-cache-dir -r requirements.txt; \
      pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132; \
    fi

COPY models /app/models
COPY Load_Model.py /app

EXPOSE 8501

CMD ["streamlit", "run", "Load_Model.py", "--server.port=8501", "--server.address=0.0.0.0"]

2. Running Pre-built Images from Registry

Instead of building locally, you can pull pre-built images directly from the registry:

Pull CPU Image:

  docker pull ghcr.io/preyumkr/email-spam-classifier:latest

Pull GPU Image (CUDA 13.2 support):

  docker pull ghcr.io/preyumkr/email-spam-classifier:latest-gpu

Running CPU Container

To start the application on a CPU-only host:

docker run -d --name email-spam -p 8501:8501 ghcr.io/preyumkr/email-spam-classifier:latest

Running GPU Container (CUDA 13.2 support)

To leverage GPU acceleration for the transformer model, ensure the following Prerequisites are met:

NVIDIA GPU available on the host machine.
NVIDIA Container Toolkit installed to support the --gpus flag. You can install it on Ubuntu/Debian using:

   # Download GPG key and add stable repository
   curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
   curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

   # Update apt and install toolkit
   sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

   # Restart Docker runtime
   sudo nvidia-ctk runtime configure --runtime=docker
   sudo systemctl restart docker

Verify runtime compatibility:

   docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

Once verified, run the GPU-enabled container using:

docker run -d --name email-spam-gpu -p 8501:8501 --gpus all ghcr.io/preyumkr/email-spam-classifier:latest-gpu

Project Repository and Resources

Visit my GitHub repository for the full source code and the pre-built Docker image at https://github.com/PreyumKr/Email_Spam_Classifier. The dataset is also included in the repository.

DEV Community