How to Build a Spam Detector with ML and Python

#datascience #machinelearning #python #beginners

All modern spam detection systems rely on machine learning. ML has proven to be superior at many classification tasks given sufficient training data.

This tutorial shows you how to build a spam detector using supervised learning. More specifically, you will use Python to train a logistic regression model that classifies emails into spam and non-spam.

Prerequisites

You will work with NumPy, SciPy, scikit-learn and Matplotlib:

import numpy as np
import scipy.io
import sklearn.metrics
import matplotlib.pyplot as plt

Download the spam dataset consisting of 4,601 emails. It is split 3:1 into a training and a test set. Each email is labeled as either 0 (non-spam) or 1 (spam) and comes with 57 features (3 length statistics on consecutive capitalised letters, frequency percentages of 48 words and 6 characters):

features = np.array(
    [
        "word_freq_make",
        "word_freq_address",
        "word_freq_all",
        "word_freq_3d",
        "word_freq_our",
        "word_freq_over",
        "word_freq_remove",
        "word_freq_internet",
        "word_freq_order",
        "word_freq_mail",
        "word_freq_receive",
        "word_freq_will",
        "word_freq_people",
        "word_freq_report",
        "word_freq_addresses",
        "word_freq_free",
        "word_freq_business",
        "word_freq_email",
        "word_freq_you",
        "word_freq_credit",
        "word_freq_your",
        "word_freq_font",
        "word_freq_000",
        "word_freq_money",
        "word_freq_hp",
        "word_freq_hpl",
        "word_freq_george",
        "word_freq_650",
        "word_freq_lab",
        "word_freq_labs",
        "word_freq_telnet",
        "word_freq_857",
        "word_freq_data",
        "word_freq_415",
        "word_freq_85",
        "word_freq_technology",
        "word_freq_1999",
        "word_freq_parts",
        "word_freq_pm",
        "word_freq_direct",
        "word_freq_cs",
        "word_freq_meeting",
        "word_freq_original",
        "word_freq_project",
        "word_freq_re",
        "word_freq_edu",
        "word_freq_table",
        "word_freq_conference",
        "char_freq_;",
        "char_freq_(",
        "char_freq_[",
        "char_freq_!",
        "char_freq_$",
        "char_freq_#",
        "capital_run_length_average",
        "capital_run_length_longest",
        "capital_run_length_total",
    ]
)

Load the data

First, load the data into appropriate train/test variables:

data = scipy.io.loadmat("spamData.mat")
X = data["Xtrain"]
N = X.shape[0]
D = X.shape[1]
Xtest = data["Xtest"]
Ntest = Xtest.shape[0]
y = data["ytrain"].squeeze().astype(int)
ytest = data["ytest"].squeeze().astype(int)

Next, normalize the scale of each feature by computing the z-scores:

Xz = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Xtestz = (Xtest - np.mean(X, axis=0)) / np.std(X, axis=0)

Define the logistic regression model

Define helper functions and the log-likelihood:

def logsumexp(x):
    offset = np.max(x, axis=0)
    return offset + np.log(np.sum(np.exp(x - offset), axis=0))

def logsigma(x):
    if not isinstance(x, np.ndarray):
        return -logsumexp(np.array([0, -x]))
    else:
        return -logsumexp(np.vstack((np.zeros(x.shape[0]),-x)))

def l(y, X, w):
    return np.sum(y*logsigma(X.dot(w)) + (1-y)*logsigma(-(X.dot(w))))

Define the gradient of the log-likelihood:

def sigma(x):
    return np.exp(x)/(1 + np.exp(x))

def dl(y, X, w):
    return (y - sigma(X.dot(w))).dot(X)

Determine the parameters using maximum likelihood estimation (MLE) and gradient descent (GD)

Here is a Python framework for implementing GD:

def optimize(obj_up, theta0, nepochs=50, eps0=0.01, verbose=True):
    f, update = obj_up
    theta = theta0
    values = np.zeros(nepochs + 1)
    eps = np.zeros(nepochs + 1)
    values[0] = f(theta0)
    eps[0] = eps0

    for epoch in range(nepochs):
        if verbose:
            print(
                "Epoch {:3d}: f={:10.3f}, eps={:10.9f}".format(
                    epoch, values[epoch], eps[epoch]
                )
            )
        theta = update(theta, eps[epoch])

        values[epoch + 1] = f(theta)
        if values[epoch] < values[epoch + 1]:
            eps[epoch + 1] = eps[epoch] / 2.0
        else:
            eps[epoch + 1] = eps[epoch] * 1.05

    if verbose:
        print("Result after {} epochs: f={}".format(nepochs, values[-1]))
    return theta, values, eps

def gd(y, X):
    def objective(w):
        return -(l(y, X, w))

    def update(w, eps):
        return w + eps * dl(y, X, w)

    return (objective, update)

You can now run GD to obtain optimized weights:

np.random.seed(0)
w0 = np.random.normal(size=D)
wz_gd, vz_gd, ez_gd = optimize(gd(y, Xz), w0, nepochs=100)

Predict

Finally, you can define a spam confidence values predictor and classifier:

def predict(Xtest, w):
    return sigma(Xtest.dot(w))

def classify(Xtest, w):
    threshold = 0.5 # initial threshold
    return (sigma(Xtest.dot(w)) > threshold).astype(int)

yhat = predict(Xtestz, wz_gd)
ypred = classify(Xtestz, wz_gd)

Plot the precision-recall-curve to find a better threshold value:

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(ytest, yhat)
plt.plot(recall, precision)
for x in np.linspace(0, 1, 10, endpoint=False):
    index = int(x * (precision.size - 1))
    plt.text(recall[index], precision[index], "{:3.2f}".format(thresholds[index]))
plt.xlabel("Recall")
plt.ylabel("Precision")
# 0.44 looks good

Have a look at the largest weights:

features[np.where(wz_gd>2)]

Unsurprisingly, you will find that char_freq_$ and capital_run_length_longest have an outsized impact, i.e., spam emails frequently include dollar signs and capitalized words.

Conclusion

In this tutorial you have learned to build an email spam detector using machine learning and Python. If you want to practice more, try finding another dataset and build a binary classification model using the framework introduced here.