Princess Mae Sanchez

Posted on Oct 31

🚫📩 Build a Spam Message Classifier with Python (Step-by-Step for Beginners)

#machinelearning #kaggle

Hey there! 👋

I recently finished Kaggle’s Intro to Machine Learning course, and to put my new skills into practice, I built a Spam Message Classifier — an AI that can tell whether a text message is spam or not.

If you’ve ever wondered how Gmail filters spam emails automatically, this post will help you understand how that works (and how you can make one yourself)!

Don’t worry if you’re starting from zero. I’ll explain everything line by line — no background knowledge required. 🧠✨

🎯 What You’ll Learn

How to train a simple AI model to detect spam messages
How to clean and prepare a dataset
How to evaluate your model’s performance
Why learning this is useful and where you can go next

Step 1: Import the Libraries

Let’s start by importing the tools we’ll need.

import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
print("✅ Libraries imported successfully!")

What’s Happening Here?

pandas – lets us handle datasets easily (like working with Excel inside Python).
numpy – for math and number operations.
sklearn (scikit-learn) – our main machine learning library.

train_test_split() – divides data into training and testing parts.
CountVectorizer() – converts words into numbers (AI can’t read text directly).
MultinomialNB() – our machine learning model (Naive Bayes classifier).
accuracy_score() & classification_report() used to check how good our AI is.

Step 2: Load the Dataset

We’ll use a public dataset from Kaggle called SMS Spam Collection.
This dataset contains thousands of real text messages labeled as either spam or ham (ham = not spam).

df = pd.read_csv('/kaggle/input/d/uciml/sms-spam-collection-dataset/spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'message']

Step 3: Understand the Data

Before training any model, we must understand what our data looks like.

print(f"Total emails: {len(df)}")
print(df.head())
print(df['label'].value_counts())

This helps us know how many spam and ham messages exist — super important for checking balance in our dataset.

Step 4: Convert Labels to Numbers

AI works with numbers, not words. So we’ll map:

spam → 1
ham → 0

df['label'] = df['label'].map({'spam': 1, 'ham': 0})

Step 5: Split the Data

We need to test the model on unseen data to see if it really learned, not memorized.

X = df['message']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

80% of the data is for training.
20% is for testing.

Step 6: Convert Text to Numbers

AI can’t “read” words. We need to represent each message as a vector of numbers using CountVectorizer.

vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

Step 7: Train the Model

Now comes the fun part — training the machine learning model!

model = MultinomialNB()
model.fit(X_train_vec, y_train)
print("✅ Model trained successfully!")

This teaches the model which words often appear in spam messages (like “free”, “win”, “click”) and which appear in normal ones.

Step 8: Test the Model

Let’s see how well it performs.

y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
print(f"🎯 Model Accuracy: {accuracy * 100:.2f}%")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

If your accuracy is around 95–98%, that’s great! 🎉
Your model can now correctly identify most spam messages.

Step 9: Test It with New Messages

Let’s make our own mini spam detector function:

def predict_spam(message):
    message_vec = vectorizer.transform([message])
    prediction = model.predict(message_vec)[0]
    probability = model.predict_proba(message_vec)[0]

    result = "🚫 SPAM" if prediction == 1 else "✅ HAM (Not Spam)"
    confidence = probability[prediction] * 100

    print(f"Message: '{message}'")
    print(f"Prediction: {result}")
    print(f"Confidence: {confidence:.1f}%\n")

Try it out! 👇

test_messages = [
    "Congratulations! You've won a free iPhone. Click here now!",
    "Hey, are we still meeting for lunch tomorrow?",
    "URGENT: Your account will be closed. Verify now!",
    "Can you send me the project report by Friday?"
]

for msg in test_messages:
    predict_spam(msg)

Output:

You’ll see predictions with confidence levels — just like Gmail’s spam filter does behind the scenes!

Purpose: Why Learn This?

Understanding how to build a spam classifier is your first step into practical AI.

Here’s why it matters:

It teaches data preprocessing (cleaning, labeling, transforming text).
You learn how machine learning models actually learn from patterns.
It’s the foundation of Natural Language Processing (NLP) — the same technology behind chatbots, Google Translate, and Siri!
You can now deploy **this model in a small web app using **Flask, so anyone can type a message and check if it’s spam.

Final Thoughts

If you made it this far, congrats! You didn’t just copy code — you built your first working AI model.
Keep experimenting, keep learning, and soon you’ll be deploying your own intelligent apps to the world. 🌍

DEV Community