Hey there! đź‘‹
I recently finished Kaggle’s Intro to Machine Learning course, and to put my new skills into practice, I built a Spam Message Classifier — an AI that can tell whether a text message is spam or not.
If you’ve ever wondered how Gmail filters spam emails automatically, this post will help you understand how that works (and how you can make one yourself)!
Don’t worry if you’re starting from zero. I’ll explain everything line by line — no background knowledge required. 🧠✨
🎯 What You’ll Learn
- How to train a simple AI model to detect spam messages
- How to clean and prepare a dataset
- How to evaluate your model’s performance
- Why learning this is useful and where you can go next
Step 1: Import the Libraries
Let’s start by importing the tools we’ll need.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
print("âś… Libraries imported successfully!")
What’s Happening Here?
- pandas – lets us handle datasets easily (like working with Excel inside Python).
- numpy – for math and number operations.
- sklearn (scikit-learn) – our main machine learning library.
-
train_test_split()– divides data into training and testing parts. -
CountVectorizer()– converts words into numbers (AI can’t read text directly). -
MultinomialNB()– our machine learning model (Naive Bayes classifier). -
accuracy_score()&classification_report()used to check how good our AI is.
Step 2: Load the Dataset
We’ll use a public dataset from Kaggle called SMS Spam Collection.
This dataset contains thousands of real text messages labeled as either spam or ham (ham = not spam).
df = pd.read_csv('/kaggle/input/d/uciml/sms-spam-collection-dataset/spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'message']
Step 3: Understand the Data
Before training any model, we must understand what our data looks like.
print(f"Total emails: {len(df)}")
print(df.head())
print(df['label'].value_counts())
This helps us know how many spam and ham messages exist — super important for checking balance in our dataset.
Step 4: Convert Labels to Numbers
AI works with numbers, not words. So we’ll map:
- spam → 1
- ham → 0
df['label'] = df['label'].map({'spam': 1, 'ham': 0})
Step 5: Split the Data
We need to test the model on unseen data to see if it really learned, not memorized.
X = df['message']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
- 80% of the data is for training.
- 20% is for testing.
Step 6: Convert Text to Numbers
AI can’t “read” words. We need to represent each message as a vector of numbers using CountVectorizer.
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
Step 7: Train the Model
Now comes the fun part — training the machine learning model!
model = MultinomialNB()
model.fit(X_train_vec, y_train)
print("âś… Model trained successfully!")
This teaches the model which words often appear in spam messages (like “free”, “win”, “click”) and which appear in normal ones.
Step 8: Test the Model
Let’s see how well it performs.
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
print(f"🎯 Model Accuracy: {accuracy * 100:.2f}%")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))
If your accuracy is around 95–98%, that’s great! 🎉
Your model can now correctly identify most spam messages.
Step 9: Test It with New Messages
Let’s make our own mini spam detector function:
def predict_spam(message):
message_vec = vectorizer.transform([message])
prediction = model.predict(message_vec)[0]
probability = model.predict_proba(message_vec)[0]
result = "đźš« SPAM" if prediction == 1 else "âś… HAM (Not Spam)"
confidence = probability[prediction] * 100
print(f"Message: '{message}'")
print(f"Prediction: {result}")
print(f"Confidence: {confidence:.1f}%\n")
Try it out! 👇
test_messages = [
"Congratulations! You've won a free iPhone. Click here now!",
"Hey, are we still meeting for lunch tomorrow?",
"URGENT: Your account will be closed. Verify now!",
"Can you send me the project report by Friday?"
]
for msg in test_messages:
predict_spam(msg)
Output:
You’ll see predictions with confidence levels — just like Gmail’s spam filter does behind the scenes!
Purpose: Why Learn This?
Understanding how to build a spam classifier is your first step into practical AI.
Here’s why it matters:
- It teaches data preprocessing (cleaning, labeling, transforming text).
- You learn how machine learning models actually learn from patterns.
- It’s the foundation of Natural Language Processing (NLP) — the same technology behind chatbots, Google Translate, and Siri!
- You can now deploy **this model in a small web app using **Flask, so anyone can type a message and check if it’s spam.
Final Thoughts
If you made it this far, congrats! You didn’t just copy code — you built your first working AI model.
Keep experimenting, keep learning, and soon you’ll be deploying your own intelligent apps to the world. 🌍

Top comments (0)