AlvBarros

Posted on Apr 11

Toxicity in Tweets using a BERT model

#python #ai #nlp #tensorflow

The goal

The goal for this project is to create a model that can accurately classify some piece of text into Toxic or not. Basically, if toxicity = 1 or 0.

This is a very simple problem to solve, all you need is a database of texts that are toxic and not, and then you can train your model on it.

The dataset

The competition specifies that the model must be able to predict texts written in Brazilian Portuguese, so the dataset is in Portuguese as well.

The dataset is based on ToLD-Br, which is a huge dataset of tweets ~~(or is it Xeets now?)~~ that contains some additional info such as a classification if the text contains homophobia, obscenity, insults, racism, misogyny and xenophobia. The dataset for the competition, however, is a simple toxicity column.

On the left, the 'Text' column contains the tweet in question, and the 'Toxicity' column if the text is either toxic or not (1 or 0)

Classification problem

Whenever you think about classification, your first guess would be that you need some kind of neural network.

As you may guess from the title of the article, BERT was chosen since it is more recent, it's built in a neural network architecture that uses transformers, which is perfect for Natural Language Processing (NLP).

How does BERT work?

Recurrent and convolutional neural networks use sequential computation to generate predictions. They can predict which word will follow a sequence of given words once trained on huge datasets - this behavior is nicknamed unidirectional algorithm.

BERT, however, has a mechanism called self-attention, which can do this prediction based on the words that precede but also that follow, or in other words, a bi-directional algorithm.

Source: Javier Canales Luna @ DataCamp

The training

First of all, the training data must be cleaned up so that less characters need to be processed by our model. There's some theory on what characters matter and what don't, but I've decided on this final function for format_text:

def format_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove words that begin with @ such as tagging @user
    text = re.sub(r'@\w+', '', text)
    # Remove words that begin with # such as #happy
    text = re.sub(r'\b#\w+\b', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove punctuation and emojis
    text = re.sub(r'[^\w\s]', '', text)
    # Remove stop words
    pt_stp_words = stopwords.words('portuguese')
    text = ' '.join([word for word in text.split() if word not in pt_stp_words])
    # Remove double spaces
    text = re.sub(r'\s+', ' ', text)

    return text

The comments are all self-explanatory. All but one: stopwords.

What are stopwords?

Stopwords are words that are very frequently found in phrases but they don't add very significant meaning.

Such words are "i", "my", "myself", "you", "your". More words can be found here.

For this project, however, I've used stopwords for the Portuguese language available in the nltk.corpus package.

The model

Now, to be used in our model we'll create a TextClassificationDataset class that'll handle the storing and encoding of our texts.

class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)
        return {'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label)}

We begin by defining that this class is a PyTorch Dataset.
The __init__ method takes the arguments texts and labels, which are the values in the train dataset in the format of a list. So, for example, the row #3 would have the content of the tweet at texts[2] and the classification at labels[2].
The argument tokenizer is used to convert the texts into a format that the model can understand - since it cannot understand straight text.
The argument max_length is used to limit the length of the tokenized sequences.
The method __len__ returns the number of samples.
The method __getitem__ is used to retrieve the specific item given an index idx. This will retrieve the item from the lists of texts and labels, as well as encoding the value using the tokenizer from __init__.
This encoding is split into two parts: input_ids and attention_mask. input_ids are the tokenized text, and attention_mask is a binary mask that indicates which tokens are actual words versus padding.
Everything is transformed into a PyTorch Tensor.

With the data cleaned up, it was time to create the BERT Classifier. For this project, I used BERTimbau Base, a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment.

These people are so creative.

In the end, this is what our BERTClassifier looked like:

class BERTClassifier(nn.Module):
    def __init__(self, bert_model_name, num_classes):
        super(BERTClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        x = self.dropout(pooled_output)
        logits = self.fc(x)
        return logits

# Example of initialization
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BERTClassifier('neuralmind/bert-base-portuguese-cased', 2).to(device)

# If there's a .pth file to load
model.load_state_dict(torch.load('bert_classifier.pth'))

Breaking this stuff into parts:

The __init__ function acts as a constructor. It sets the pretrained BertModel from the given bert_model_name, add a dropout layer to keep things in check and a linear layer to help classify text into num_classes - in our case, 2 polar opposites.
The forward function is defined so that it correctly goes through the additional layers we've set up.

Please note that I didn't tinker a lot with these, since they were kind of default from the sources that I was studying.

Given all of that, now we need our train function. We'll need a lot of things, though:

# Set up parameters
bert_model_name = 'neuralmind/bert-base-portuguese-cased'
num_classes = 2
max_length = 128
batch_size = 16
num_epochs = 2
learning_rate = 2e-5

def train(model, data_loader, optimizer, scheduler, device):
    model.train()
    for batch in data_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()

## Begin training

# Split into train and validation datasets
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)
val_dataset = TextClassificationDataset(val_texts, val_labels, tokenizer, max_length)

# Create DataLoader for batch processing
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)

# Additional steps
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

A lot to unpack here:

First, we define some parameters that are going to be used in the model.
num_classes is simple: either toxic, or not.
max_length as already described is the max length of the encoded text.
batch_size would be the number of samples to work through before the model's internal parameters are updated. This value is a choice of balance between reasonable memory requirements without that much loss of performance.
learning_rate is 2e-5, which would be 0.00002. If the learning rate is too high, the model might overshoot the minimum of the loss function and fail to converge. If the rate is too low, the model might get stuck in a sub optimal solution. The value of 2e-5 is commonly used since it is small enough to allow the model to make gradual progress without overshooting or converging to slowly.

Let's skip the train method for now and explain the items below:

The optimizer is used to adjust the parameters of our model to normalize the error or loss function. The optimizer changes the weighs and biases of the neurons in response to the error the model produced in its prediction during training. AdamW is a variation of the Adam optimizer.
The total_steps are the total number of steps that will be run, given that each epoch goes through the entire dataset once - so "amount of epochs" times "amount of rows in each epoch".
The learning rate scheduler, scheduler, is used to adjust learning rate during training. It is used to adjust the learning rate during training, and has proven to avoid overfitting, convergence faster and escape saddle points.

Given everything that was said (and I know that it's too much!), now let's break down the train method:

First, it sets the model to training mode.
Then, enters a loop for each batch of the data loader.
In this loop, it clears the gradient since they're accumulated in PyTorch. It needs to be reset for each batch.
It moves the batch to the device being used to training, such as the CPU or GPU.
Then, it retrieves the input IDs, attention masks and everything else. This is used as input to the model.
Then, with whatever the model outputs, loss is calcuated with the CrossEntropyLoss function.
It performs backpropagation by calling loss.backward().
The optimizer.step() applies the gradients computed in the previous step to update the model's parameters.
Finally, the learning rate is adjusted with scheduler.step().

Phew! A lot of things to uncover.

In the end, we can just call the train function for each epoch, and then save the model as a .pth file.

for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        train(model, train_dataloader, optimizer, scheduler, device)
        accuracy, report = evaluate(model, val_dataloader, device)
        print(f"Validation Accuracy: {accuracy:.4f}")
        print(report)

torch.save(model.state_dict(), "bert_classifier.pth")

This model will be available in the path, and can be imported and used to predict the toxicity of texts! Here's one example:

def predict_sentiment(text, model, tokenizer, device, max_length=128):
    model.eval()
    encoding = tokenizer(text, return_tensors='pt', max_length=max_length, padding='max_length', truncation=True)
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs, dim=1)
    return 1 if preds.item() == 1 else 0

# Load the model from the .pth file
model = BERTClassifier('neuralmind/bert-base-portuguese-cased', 2)
model.load_state_dict(torch.load('bert_pt_classifier.pth'))
print(predict_sentiment('Hello world!', model, tokenizer, device)) # Returns 0

Conclusion

And that's it! If you want to check it out and train/test this model yourself, feel free to check the code in my GitHub repository!

This post was born out of my first Kaggle competition!

Despite not winning the competition, I'm still very close to the top, with 0.00952 setting me apart from the first place, so I hope my experience can also teach other beginners something useful!

I'm already a software engineer at work, but artificial intelligence has always been a source of curiosity for me. When I was in college, I had a brief exposure to computer vision and even ended up publishing some scientific articles. Now, I'm trying to make up for lost time studying and learning AI again. Follow me to join me on my journey!

Special thanks

First of all, special thanks to Pedro Gengo and the folks over at Tensorflow User Group São Paulo for creating the Kaggle Competition and inspiring this project!

Also, huge thanks to Kang Pham for writing this tutorial where I got most of this code!

And finally, thanks for Pedro Henrique Vieira de Lima whose work on Detecção de Comentários Tóxicos em Chats e Redes Sociais com Deep Learning was crucial for hitting a higher score on the leaderboard.

Top comments (1)

Augusto Carvalho • Apr 11

Amazing :0

DEV Community

Toxicity in Tweets using a BERT model

The goal

The dataset

Classification problem

How does BERT work?

The training

What are stopwords?

The model

Conclusion

Special thanks

Top comments (1)

Read next

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

A Survey on Self-Evolution of Large Language Models