DEV Community

SHH
SHH

Posted on

How to Build a Spam Detector with ML and Python

All modern spam detection systems rely on machine learning. ML has proven to be superior at many classification tasks given sufficient training data.

This tutorial shows you how to build a spam detector using supervised learning. More specifically, you will use Python to train a logistic regression model that classifies emails into spam and non-spam.

Prerequisites

You will work with NumPy, SciPy, scikit-learn and Matplotlib:

import numpy as np
import scipy.io
import sklearn.metrics
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode

Download the spam dataset consisting of 4,601 emails. It is split 3:1 into a training and a test set. Each email is labeled as either 0 (non-spam) or 1 (spam) and comes with 57 features (3 length statistics on consecutive capitalised letters, frequency percentages of 48 words and 6 characters):

features = np.array(
    [
        "word_freq_make",
        "word_freq_address",
        "word_freq_all",
        "word_freq_3d",
        "word_freq_our",
        "word_freq_over",
        "word_freq_remove",
        "word_freq_internet",
        "word_freq_order",
        "word_freq_mail",
        "word_freq_receive",
        "word_freq_will",
        "word_freq_people",
        "word_freq_report",
        "word_freq_addresses",
        "word_freq_free",
        "word_freq_business",
        "word_freq_email",
        "word_freq_you",
        "word_freq_credit",
        "word_freq_your",
        "word_freq_font",
        "word_freq_000",
        "word_freq_money",
        "word_freq_hp",
        "word_freq_hpl",
        "word_freq_george",
        "word_freq_650",
        "word_freq_lab",
        "word_freq_labs",
        "word_freq_telnet",
        "word_freq_857",
        "word_freq_data",
        "word_freq_415",
        "word_freq_85",
        "word_freq_technology",
        "word_freq_1999",
        "word_freq_parts",
        "word_freq_pm",
        "word_freq_direct",
        "word_freq_cs",
        "word_freq_meeting",
        "word_freq_original",
        "word_freq_project",
        "word_freq_re",
        "word_freq_edu",
        "word_freq_table",
        "word_freq_conference",
        "char_freq_;",
        "char_freq_(",
        "char_freq_[",
        "char_freq_!",
        "char_freq_$",
        "char_freq_#",
        "capital_run_length_average",
        "capital_run_length_longest",
        "capital_run_length_total",
    ]
)
Enter fullscreen mode Exit fullscreen mode

Load the data

First, load the data into appropriate train/test variables:

data = scipy.io.loadmat("spamData.mat")
X = data["Xtrain"]
N = X.shape[0]
D = X.shape[1]
Xtest = data["Xtest"]
Ntest = Xtest.shape[0]
y = data["ytrain"].squeeze().astype(int)
ytest = data["ytest"].squeeze().astype(int)
Enter fullscreen mode Exit fullscreen mode

Next, normalize the scale of each feature by computing the z-scores:

Xz = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Xtestz = (Xtest - np.mean(X, axis=0)) / np.std(X, axis=0)
Enter fullscreen mode Exit fullscreen mode

Define the logistic regression model

Define helper functions and the log-likelihood:

def logsumexp(x):
    offset = np.max(x, axis=0)
    return offset + np.log(np.sum(np.exp(x - offset), axis=0))

def logsigma(x):
    if not isinstance(x, np.ndarray):
        return -logsumexp(np.array([0, -x]))
    else:
        return -logsumexp(np.vstack((np.zeros(x.shape[0]),-x)))

def l(y, X, w):
    return np.sum(y*logsigma(X.dot(w)) + (1-y)*logsigma(-(X.dot(w))))

Enter fullscreen mode Exit fullscreen mode

Define the gradient of the log-likelihood:

def sigma(x):
    return np.exp(x)/(1 + np.exp(x))

def dl(y, X, w):
    return (y - sigma(X.dot(w))).dot(X)
Enter fullscreen mode Exit fullscreen mode

Determine the parameters using maximum likelihood estimation (MLE) and gradient descent (GD)

Here is a Python framework for implementing GD:

def optimize(obj_up, theta0, nepochs=50, eps0=0.01, verbose=True):
    f, update = obj_up
    theta = theta0
    values = np.zeros(nepochs + 1)
    eps = np.zeros(nepochs + 1)
    values[0] = f(theta0)
    eps[0] = eps0

    for epoch in range(nepochs):
        if verbose:
            print(
                "Epoch {:3d}: f={:10.3f}, eps={:10.9f}".format(
                    epoch, values[epoch], eps[epoch]
                )
            )
        theta = update(theta, eps[epoch])

        values[epoch + 1] = f(theta)
        if values[epoch] < values[epoch + 1]:
            eps[epoch + 1] = eps[epoch] / 2.0
        else:
            eps[epoch + 1] = eps[epoch] * 1.05

    if verbose:
        print("Result after {} epochs: f={}".format(nepochs, values[-1]))
    return theta, values, eps

def gd(y, X):
    def objective(w):
        return -(l(y, X, w))

    def update(w, eps):
        return w + eps * dl(y, X, w)

    return (objective, update)
Enter fullscreen mode Exit fullscreen mode

You can now run GD to obtain optimized weights:

np.random.seed(0)
w0 = np.random.normal(size=D)
wz_gd, vz_gd, ez_gd = optimize(gd(y, Xz), w0, nepochs=100)
Enter fullscreen mode Exit fullscreen mode

Predict

Finally, you can define a spam confidence values predictor and classifier:

def predict(Xtest, w):
    return sigma(Xtest.dot(w))

def classify(Xtest, w):
    threshold = 0.5 # initial threshold
    return (sigma(Xtest.dot(w)) > threshold).astype(int)

yhat = predict(Xtestz, wz_gd)
ypred = classify(Xtestz, wz_gd)
Enter fullscreen mode Exit fullscreen mode

Plot the precision-recall-curve to find a better threshold value:

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(ytest, yhat)
plt.plot(recall, precision)
for x in np.linspace(0, 1, 10, endpoint=False):
    index = int(x * (precision.size - 1))
    plt.text(recall[index], precision[index], "{:3.2f}".format(thresholds[index]))
plt.xlabel("Recall")
plt.ylabel("Precision")
# 0.44 looks good
Enter fullscreen mode Exit fullscreen mode

Have a look at the largest weights:

features[np.where(wz_gd>2)]
Enter fullscreen mode Exit fullscreen mode

Unsurprisingly, you will find that char_freq_$ and capital_run_length_longest have an outsized impact, i.e., spam emails frequently include dollar signs and capitalized words.

Conclusion

In this tutorial you have learned to build an email spam detector using machine learning and Python. If you want to practice more, try finding another dataset and build a binary classification model using the framework introduced here.

Top comments (0)