Hey devs! π Ever wondered how to build a real-world NLP classifier? Today, we're diving into clickbait detection using scikit-learn, TF-IDF, and Random Forest. I'll walk you through the entire process, from data prep to deployment on Hugging Face.
Why Clickbait Detection Matters
In the age of social media, clickbait wastes time and spreads misinformation. As developers, we can build tools to combat this. My model achieves 91.45% accuracy on 32,000 headlines.
Dataset & Setup
We're using the Clickbait Dataset from Kaggle. Balanced classes: 16K clickbait, 16K real news.
pip install pandas scikit-learn matplotlib seaborn joblib huggingface_hub
Step 1: Data Loading & Preprocessing
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Load data
df = pd.read_csv("clickbait_data.csv")
df.dropna(inplace=True)
# Map labels
df.rename(columns={'clickbait': 'label'}, inplace=True)
df['label'] = df['label'].map({0: 'real', 1: 'clickbait'})
print(f"Dataset shape: {df.shape}")
print(df.head())
Step 2: Train-Test Split
Stratified split to maintain class balance:
X_train, X_test, y_train, y_test = train_test_split(
df['headline'], df['label'],
test_size=0.2,
random_state=42,
stratify=df['label']
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
Step 3: Feature Extraction with TF-IDF
Convert text to numerical features:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
print(f"Feature matrix shape: {X_train_vec.shape}")
Step 4: Model Training
Random Forest for robust classification:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
model.fit(X_train_vec, y_train)
print("Model trained! β
")
Step 5: Evaluation
Check performance on test set:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
y_pred = model.predict(X_test_vec)
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
print(classification_report(y_test, y_pred))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()
Results:
- Accuracy: 91.45%
- Macro F1: 0.91
Testing on Real Headlines
test_headlines = [
"You won't believe what this celebrity did!",
"New study reveals surprising health benefits",
"10 hacks to boost your productivity"
]
predictions = model.predict(vectorizer.transform(test_headlines))
for text, pred in zip(test_headlines, predictions):
print(f"'{text}' β {pred}")
Deploy to Hugging Face
Save and upload the model:
import joblib
from huggingface_hub import HfApi
# Save locally
joblib.dump(model, "clickbait_detector.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")
# Upload
api = HfApi()
api.upload_file(
path_or_fileobj="clickbait_detector.pkl",
path_in_repo="clickbait_detector.pkl",
repo_id="Devishetty100/clickbait-detector",
token="your-hf-token"
)
# Same for vectorizer
Usage in Production
from huggingface_hub import hf_hub_download
import joblib
# Load from HF
model_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="clickbait_detector.pkl")
vectorizer_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="tfidf_vectorizer.pkl")
model = joblib.load(model_path)
vectorizer = joblib.load(vectorizer_path)
# Predict
def detect_clickbait(headline):
features = vectorizer.transform([headline])
return model.predict(features)[0]
print(detect_clickbait("Shocking truth about coffee!"))
Next Steps & Improvements
While this model performs well, here are some ideas for future enhancements (these are suggestions, not planned features):
- Try BERT or other transformers for better accuracy
- Add multilingual support
- Build a web API with FastAPI
- Integrate into browser extensions
Feel free to fork the notebook and experiment!
What do you think? Have you built similar classifiers? Share your projects in the comments!
π Kaggle Notebook | HF Model | Demo Space
Top comments (0)