Clickbait Detection with Machine Learning: A Complete Python Tutorial

#machinelearning #python #datascience #tutorial

Hey devs! 👋 Ever wondered how to build a real-world NLP classifier? Today, we're diving into clickbait detection using scikit-learn, TF-IDF, and Random Forest. I'll walk you through the entire process, from data prep to deployment on Hugging Face.

Why Clickbait Detection Matters

In the age of social media, clickbait wastes time and spreads misinformation. As developers, we can build tools to combat this. My model achieves 91.45% accuracy on 32,000 headlines.

Dataset & Setup

We're using the Clickbait Dataset from Kaggle. Balanced classes: 16K clickbait, 16K real news.

pip install pandas scikit-learn matplotlib seaborn joblib huggingface_hub

Step 1: Data Loading & Preprocessing

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv("clickbait_data.csv")
df.dropna(inplace=True)

# Map labels
df.rename(columns={'clickbait': 'label'}, inplace=True)
df['label'] = df['label'].map({0: 'real', 1: 'clickbait'})

print(f"Dataset shape: {df.shape}")
print(df.head())

Step 2: Train-Test Split

Stratified split to maintain class balance:

X_train, X_test, y_train, y_test = train_test_split(
    df['headline'], df['label'], 
    test_size=0.2, 
    random_state=42, 
    stratify=df['label']
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")

Step 3: Feature Extraction with TF-IDF

Convert text to numerical features:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f"Feature matrix shape: {X_train_vec.shape}")

Step 4: Model Training

Random Forest for robust classification:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
model.fit(X_train_vec, y_train)

print("Model trained! ✅")

Step 5: Evaluation

Check performance on test set:

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_pred = model.predict(X_test_vec)
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

Results:

Accuracy: 91.45%
Macro F1: 0.91

Testing on Real Headlines

test_headlines = [
    "You won't believe what this celebrity did!",
    "New study reveals surprising health benefits",
    "10 hacks to boost your productivity"
]

predictions = model.predict(vectorizer.transform(test_headlines))
for text, pred in zip(test_headlines, predictions):
    print(f"'{text}' → {pred}")

Deploy to Hugging Face

Save and upload the model:

import joblib
from huggingface_hub import HfApi

# Save locally
joblib.dump(model, "clickbait_detector.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

# Upload
api = HfApi()
api.upload_file(
    path_or_fileobj="clickbait_detector.pkl",
    path_in_repo="clickbait_detector.pkl",
    repo_id="Devishetty100/clickbait-detector",
    token="your-hf-token"
)
# Same for vectorizer

Usage in Production

from huggingface_hub import hf_hub_download
import joblib

# Load from HF
model_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="clickbait_detector.pkl")
vectorizer_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="tfidf_vectorizer.pkl")

model = joblib.load(model_path)
vectorizer = joblib.load(vectorizer_path)

# Predict
def detect_clickbait(headline):
    features = vectorizer.transform([headline])
    return model.predict(features)[0]

print(detect_clickbait("Shocking truth about coffee!"))