Akan

Posted on Mar 22

Text Classification Magic: Transform Raw Text into Emotional Insights

#programming #productivity #machinelearning #datascience

Text classification isn't just a technical concept - it's your secret weapon for making sense of messy, unstructured text data! Imagine instantly categorizing thousands of customer comments, product reviews, or support tickets into meaningful emotions like 'joy', 'sadness', or 'anger' - all without reading a single word manually. This isn't just convenient; it's a game-changer for understanding customer sentiment at scale.

In this hands-on tutorial, we'll dive deep into the fascinating world of emotion detection using two powerful approaches:

The DIY Approach: We'll roll up our sleeves and fine-tune a lightning-fast version of BERT (DistilBERT) ourselves. You'll feel that rush of accomplishment when your custom model delivers 92.5% accuracy! 💪
The API Approach: For those who prefer speed and simplicity, we'll explore Cohere's Classification API that lets you build production-ready classifiers with minimal technical overhead.

The Magic of Text Classification

Think of text classification as your personal data sorting wizard. Feed it thousands of comments, reviews, or messages, and watch as it automatically organizes them into categories that matter to your business. The beauty lies in its adaptability—train it once with your labeled examples, and it keeps learning, getting smarter with every prediction.

When it comes to implementation, you have two exciting paths forward:

Fine-tuning pre-trained powerhouses like DistilBERT: This "knowledge-distilled" version of BERT gives you 97% of the performance at a fraction of the computational cost. While this approach offers incredible flexibility, it does require understanding PyTorch (.pt) files and some computing muscle. My experiments with 16,000 training examples showed dramatic training time differences between local CPUs and specialized hardware like Google Colab's TPUs!
Leveraging specialized classification APIs like Cohere's Classification Tuner: This approach gives you enterprise-grade results without the technical complexity. Simply upload your labeled data, and Cohere handles the heavy lifting, providing you with a simple API for making predictions.

In this tutorial, we'll build both solutions and create a sleek Streamlit interface to showcase your emotion detector in action. Whether you're a hands-on ML engineer or looking for the quickest path to production, you'll walk away with practical knowledge you can apply immediately. We're using an emotions dataset generously shared on Kaggle here.

Our journey breaks down into three exciting parts:

Part 1: The DIY approach with DistilBERT
Part 2: The streamlined Cohere approach (coming soon!)
Part 3: Building a stunning Streamlit UI to showcase your model (coming soon!)

Part 1: DIY - Fine-tuning DistilBERT for Emotion Detection

First things first: you'll need a Google Account to access Colab, a Hugging Face account for model hosting, and some Python libraries. But before we start coding, let's understand what makes DistilBERT so special.

Understanding DistilBERT: Small Package, Big Performance

Developed by the brilliant team at Hugging Face, DistilBERT is like BERT's more efficient younger sibling. Through a process called knowledge distillation, it captures the essence of what makes BERT powerful while being significantly smaller and faster. The result? The model that achieves about 97% of BERT's performance while being 40% smaller and 60% faster. This is perfect for when you need production-ready performance without breaking the bank on computing resources!

Let's Get Our Hands Dirty: Fine-tuning DistilBERT

Fire up your Google Colab notebook, switch your runtime to use a GPU accelerator (trust me, you'll want this!), and let's transform DistilBERT into an emotion-detection powerhouse:

Step 1: Set Up Your Environment

# Environment Setup - Watch how quickly these install with Colab's lightning-fast internet!
! pip install transformers datasets accelerate pandas numpy evaluate

# Import Libraries
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from datasets import Dataset
import evaluate

# Access your data from Google Drive
from google.colab import drive
drive.mount('/content/drive')

Step 2: Load and Prepare Your Emotion Data

Here's where the magic begins—we'll transform raw text files into structured data that our model can learn from:

def parse_emotion_file(file_path):
    texts, emotions = [], []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            text, emotion = line.strip().split(';')
            texts.append(text)
            emotions.append(emotion)
    return pd.DataFrame({'text': texts, 'emotion': emotions})

# Define your data directories
PARENT_DIR = "/content/drive/MyDrive/MLEng/"
MODEL_DIR = PARENT_DIR + "model_outputs/"

# Load your datasets - these will be the foundation of your emotion detector!
train_df = parse_emotion_file(PARENT_DIR + "data/train.txt")
val_df = parse_emotion_file(PARENT_DIR + "data/val.txt")
test_df = parse_emotion_file(PARENT_DIR + "data/test.txt")

# Get a glimpse of what you're working with
print(f"Training examples: {len(train_df)}")
print(f"Validation examples: {len(val_df)}")
print(f"Test examples: {len(test_df)}")
print(f"Emotions in our dataset: {sorted(train_df['emotion'].unique())}")

Step 3: Tokenization - Teaching Your Model to Understand Text

Tokenization transforms human language into something models can process. Think of it as creating a specialized vocabulary for your AI:

# Load the tokenizer - this will convert text to tokens our model understands
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Create datasets from your dataframes
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

# This function transforms your text into sequences of token IDs
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Apply tokenization to all your datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Step 4: Prepare Labels for Training

Our model needs to understand which emotions correspond to which numerical values:

# Create a mapping between emotion labels and IDs
labels = sorted(train_df["emotion"].unique())
label2id = {label: idx for idx, label in enumerate(labels)}
id2label = {idx: label for label, idx in label2id.items()}

print("Our emotion mappings:")
for label, idx in label2id.items():
    print(f"{label}: {idx}")

# Add numerical labels to your datasets
def add_labels(examples):
    examples["labels"] = [label2id[emotion] for emotion in examples["emotion"]]
    return examples

train_dataset = train_dataset.map(add_labels, batched=True)
val_dataset = val_dataset.map(add_labels, batched=True)
test_dataset = test_dataset.map(add_labels, batched=True)

Step 5: Load and Configure DistilBERT for Emotion Classification

Now we'll harness the power of transfer learning by starting with pre-trained DistilBERT:

# Set up an evaluation metric - accuracy is perfect for multi-class classification
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

# Load the pre-trained model and customize it for our emotion classification task
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(labels),
    label2id=label2id,
    id2label=id2label
)

Step 6: Train Your Emotion Detector!

This is where all your preparation pays off - we'll train DistilBERT to detect emotions with incredible accuracy:

# Configure training parameters - these are carefully tuned for optimal learning
training_args = TrainingArguments(
    output_dir=os.path.join(MODEL_DIR, "checkpoints"),
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,  # Small learning rate for fine-tuning
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,  # Prevents overfitting
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,  # Mixed precision training for speed
    gradient_accumulation_steps=2
)

# Initialize the Trainer with everything it needs
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]  # Stop if not improving
)

# The moment of truth - train your model!
print("🚀 Training started! This might take a while, but the results will be worth it...")
trainer.train()

Step 7: Evaluate Your Model's Performance

Let's see how well your emotion detector performs on unseen data:

# Evaluate on the test set
test_results = trainer.evaluate(test_dataset)
print(f"📊 Test results: {test_results}")

# A glimpse at what your model achieved
print(f"🎯 Accuracy: {test_results['eval_accuracy']:.2%}")

When I ran this, the results were spectacular:

Test results: {'eval_loss': 0.15945003926753998, 'eval_accuracy': 0.925, 'eval_runtime': 1.6421, 'eval_samples_per_second': 1217.947, 'eval_steps_per_second': 19.487, 'epoch': 5.0}
Model saved to /content/drive/MyDrive/MLEng/model_outputs/final_model

That's a 92.5% accuracy! Your model has learned to recognize emotions from text with near-human level performance!

Step 8: Save Your Emotion-Detecting Masterpiece

Let's save your model so you can use it anytime, anywhere:

# Save the fine-tuned model and tokenizer
final_model_path = os.path.join(MODEL_DIR, "final_model")
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)

print(f"✅ Model successfully saved to {final_model_path}")

Step 9: Take Your Model for a Spin!

The moment you've been waiting for - see your emotion detector in action:

from transformers import pipeline

# Create an emotion classification pipeline
emotion_classifier = pipeline("text-classification", model=final_model_path, tokenizer=final_model_path)

# Try it on some examples
test_texts = [
    "I'm feeling very happy today!",
    "This news is absolutely devastating.",
    "I'm so angry I could scream!",
    "The test results have me feeling very anxious."
]

# Get predictions
for text in test_texts:
    result = emotion_classifier(text)
    print(f"Text: '{text}'")
    print(f"Predicted emotion: {result[0]['label']}, Confidence: {result[0]['score']:.2%}\n")

Congratulations! You've successfully:

Built a powerful emotion detection model using state-of-the-art NLP techniques
Achieved impressive 92.5% accuracy on unseen data
Created a model that can instantly categorize text by emotion
Mastered the fine-tuning process for transformer models

In the next parts of this tutorial, we'll explore how to:

Build the same classifier using Cohere's easy-to-use API (Part 2)
Create an interactive Streamlit app to showcase your emotion detector (Part 3)

Stay tuned for these exciting follow-ups, and in the meantime, try experimenting with your own text examples to see how accurately your model can detect emotions!

Resources for Further Exploration

Special thanks to Praveen for the original dataset.

Have you tried building your own emotion detector? What accuracy did you achieve? Share your results in the comments below!

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

DEV Community