Gruhesh Sri Sai Karthik Kurra

Posted on Jul 16 • Originally published at your-blog-url.com

Building a Sentiment Analysis Model with LSTMs in PyTorch

#python #pytorch #nlp #machinelearning

Introduction

Sequence modeling is a powerful technique for understanding and predicting patterns in ordered data. From predicting the next word in a sentence to forecasting stock prices, sequence models are everywhere. In this post, we'll dive deep into sequence modeling by building a sentiment analysis model using a bidirectional Long Short-Term Memory (LSTM) network in PyTorch.

We'll be working with a synthetic movie review dataset, which will allow us to focus on the model-building process without getting bogged down in complex data cleaning. By the end of this tutorial, you'll have a solid understanding of how to build, train, and evaluate your own LSTM-based sentiment analysis model.

What We'll Cover

The Basics of Sequence Modeling: A quick refresher on why sequence data is special.
Setting Up the Project: We'll define our dataset and model classes in PyTorch.
Generating a Synthetic Dataset: We'll create a realistic movie review dataset to train our model.
Building the Vocabulary: How to map our words to numbers that our model can understand.
The LSTM Model: A detailed look at the architecture of our bidirectional LSTM.
Training and Evaluation: We'll train our model and evaluate its performance.
Making Predictions: We'll test our model on new, unseen movie reviews.

1. Setting Up the Project

First, let's import the necessary libraries and set up our device. We'll prioritize using a GPU (either CUDA or Apple's MPS) if available.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import random

# Set up device
device = torch.device('mps' if torch.backends.mps.is_available() else 'cuda' if torch.cuda.is_available() else 'cpu')

2. The Dataset

To handle our text data, we'll create a custom TextDataset class in PyTorch. This class will take care of tokenizing our text, converting tokens to numerical IDs, and padding/truncating sequences to a fixed length. This is a crucial step for preparing our data for the model.

class TextDataset(Dataset):
    def __init__(self, texts, labels, vocab_to_idx, max_length=50):
        self.texts = texts
        self.labels = labels
        self.vocab_to_idx = vocab_to_idx
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        tokens = text.lower().split()
        token_ids = [self.vocab_to_idx.get(token, self.vocab_to_idx['<UNK>']) for token in tokens]

        if len(token_ids) < self.max_length:
            token_ids.extend([self.vocab_to_idx['<PAD>']] * (self.max_length - len(token_ids)))
        else:
            token_ids = token_ids[:self.max_length]

        return torch.tensor(token_ids, dtype=torch.long), torch.tensor(label, dtype=torch.long)

3. The LSTM Model

Now, let's define our LSTMClassifier. This model uses a bidirectional LSTM, which allows it to process sequences in both forward and backward directions, capturing context from the entire sentence.

Here's a breakdown of the architecture:

Embedding Layer: Converts word indices into dense vector representations.
Bidirectional LSTM: Processes the embedded sequences to learn temporal dependencies.
Dropout: A regularization technique to prevent overfitting.
Fully Connected Layer: A linear layer that maps the LSTM's output to our final sentiment predictions (positive or negative).

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, num_classes, dropout=0.3):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, 
                           batch_first=True, dropout=dropout, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)
        h0 = torch.zeros(self.num_layers * 2, x.size(0), self.hidden_dim).to(x.device)
        c0 = torch.zeros(self.num_layers * 2, x.size(0), self.hidden_dim).to(x.device)

        lstm_out, _ = self.lstm(embedded, (h0, c0))

        output = self.dropout(lstm_out[:, -1, :])
        output = self.fc(output)

        return output

4. Preparing the Data

With our classes defined, we can now generate our dataset, build our vocabulary, and create our data loaders. We'll use a template-based approach to create a synthetic dataset of movie reviews with clear sentiment.

# This function is defined in the notebook. For brevity, we'll just call it.
texts, labels = create_realistic_movie_dataset(num_samples=2000)
vocab_to_idx = build_vocabulary(texts, min_freq=2)

# Split the data
X_temp, X_test, y_temp, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42, stratify=labels)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)

# Create datasets and dataloaders
train_dataset = TextDataset(X_train, y_train, vocab_to_idx)
val_dataset = TextDataset(X_val, y_val, vocab_to_idx)
test_dataset = TextDataset(X_test, y_test, vocab_to_idx)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

5. Training the Model

Now for the exciting part! We'll instantiate our model and train it using our train_loader and val_loader. Our training loop includes a loss function (Cross-Entropy), an optimizer (Adam), a learning rate scheduler, and early stopping to prevent overfitting.

# Model Hyperparameters
vocab_size = len(vocab_to_idx)
embedding_dim = 128
hidden_dim = 64
num_layers = 2
num_classes = 2

# Initialize and train the model
model = LSTMClassifier(vocab_size, embedding_dim, hidden_dim, num_layers, num_classes, dropout=0.2)
model.to(device)

# The train_model function is in the notebook and handles the training loop.
train_losses, val_losses, val_accuracies = train_model(model, train_loader, val_loader, num_epochs=20)

6. Evaluating the Model

After training, it's essential to evaluate our model's performance on the test set. This gives us an unbiased estimate of how well our model will perform on new, unseen data.

# The test_model function is in the notebook and returns performance metrics.
test_accuracy, predictions, targets, probabilities = test_model(model, test_loader)

The output of our test_model function gives us a classification report. As you can see, our model is doing a great job of identifying positive reviews, but it's struggling with negative ones. This is a common issue in sentiment analysis and could be addressed by using a more balanced dataset or more advanced techniques.

7. Making Predictions

Let's see our model in action with some sample reviews:

sample_texts = [
    "This movie is absolutely fantastic and amazing! I loved every minute of it.",
    "Terrible boring film, complete waste of time. I hated everything about it.",
    "Excellent story with wonderful acting and brilliant performance throughout.",
    "Awful movie with horrible dialogue. Disappointed and would not recommend."
]

# The demonstrate_predictions function is in the notebook.
demonstrate_predictions(model, vocab_to_idx, sample_texts)

Conclusion

And there you have it! We've successfully built, trained, and evaluated a bidirectional LSTM for sentiment analysis. While our model's performance isn't perfect, it provides a solid foundation that you can build upon.

You can find the full code for this project on GitHub and Hugging Face.

Happy coding!

DEV Community