DEV Community

Cover image for Clickbait Detection with Machine Learning: A Complete Python Tutorial
Deviprasad Shetty
Deviprasad Shetty Subscriber

Posted on

Clickbait Detection with Machine Learning: A Complete Python Tutorial

Hey devs! πŸ‘‹ Ever wondered how to build a real-world NLP classifier? Today, we're diving into clickbait detection using scikit-learn, TF-IDF, and Random Forest. I'll walk you through the entire process, from data prep to deployment on Hugging Face.

Why Clickbait Detection Matters

In the age of social media, clickbait wastes time and spreads misinformation. As developers, we can build tools to combat this. My model achieves 91.45% accuracy on 32,000 headlines.

Dataset & Setup

We're using the Clickbait Dataset from Kaggle. Balanced classes: 16K clickbait, 16K real news.

pip install pandas scikit-learn matplotlib seaborn joblib huggingface_hub
Enter fullscreen mode Exit fullscreen mode

Step 1: Data Loading & Preprocessing

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv("clickbait_data.csv")
df.dropna(inplace=True)

# Map labels
df.rename(columns={'clickbait': 'label'}, inplace=True)
df['label'] = df['label'].map({0: 'real', 1: 'clickbait'})

print(f"Dataset shape: {df.shape}")
print(df.head())
Enter fullscreen mode Exit fullscreen mode

Step 2: Train-Test Split

Stratified split to maintain class balance:

X_train, X_test, y_train, y_test = train_test_split(
    df['headline'], df['label'], 
    test_size=0.2, 
    random_state=42, 
    stratify=df['label']
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")
Enter fullscreen mode Exit fullscreen mode

Step 3: Feature Extraction with TF-IDF

Convert text to numerical features:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f"Feature matrix shape: {X_train_vec.shape}")
Enter fullscreen mode Exit fullscreen mode

Step 4: Model Training

Random Forest for robust classification:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
model.fit(X_train_vec, y_train)

print("Model trained! βœ…")
Enter fullscreen mode Exit fullscreen mode

Step 5: Evaluation

Check performance on test set:

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_pred = model.predict(X_test_vec)
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Results:

  • Accuracy: 91.45%
  • Macro F1: 0.91

Testing on Real Headlines

test_headlines = [
    "You won't believe what this celebrity did!",
    "New study reveals surprising health benefits",
    "10 hacks to boost your productivity"
]

predictions = model.predict(vectorizer.transform(test_headlines))
for text, pred in zip(test_headlines, predictions):
    print(f"'{text}' β†’ {pred}")
Enter fullscreen mode Exit fullscreen mode

Deploy to Hugging Face

Save and upload the model:

import joblib
from huggingface_hub import HfApi

# Save locally
joblib.dump(model, "clickbait_detector.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

# Upload
api = HfApi()
api.upload_file(
    path_or_fileobj="clickbait_detector.pkl",
    path_in_repo="clickbait_detector.pkl",
    repo_id="Devishetty100/clickbait-detector",
    token="your-hf-token"
)
# Same for vectorizer
Enter fullscreen mode Exit fullscreen mode

Usage in Production

from huggingface_hub import hf_hub_download
import joblib

# Load from HF
model_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="clickbait_detector.pkl")
vectorizer_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="tfidf_vectorizer.pkl")

model = joblib.load(model_path)
vectorizer = joblib.load(vectorizer_path)

# Predict
def detect_clickbait(headline):
    features = vectorizer.transform([headline])
    return model.predict(features)[0]

print(detect_clickbait("Shocking truth about coffee!"))
Enter fullscreen mode Exit fullscreen mode

Next Steps & Improvements

While this model performs well, here are some ideas for future enhancements (these are suggestions, not planned features):

  • Try BERT or other transformers for better accuracy
  • Add multilingual support
  • Build a web API with FastAPI
  • Integrate into browser extensions

Feel free to fork the notebook and experiment!

What do you think? Have you built similar classifiers? Share your projects in the comments!

πŸ”— Kaggle Notebook | HF Model | Demo Space

Top comments (0)