Creating Fake News Detection Model

To Begin With

Let me share what I made on a CTU class. Recently, the growth of the generative AI is significant. Our IT lives have been enriched by generativeAI such as ChatGPT,Copilot, etc. However, it is not only the expectation that life will become more convenient, but also the risk of misuse by criminals and other criminals.
Fake news is one of the concerned use of the AI. Fake news is a powerful tool of misinformation that can distort public perception, incite fear, and undermine trust in legitimate sources, posing a significant threat to democracy and societal harmony.
With this back ground, I decided to make the model that can detect whether the article is real or fake.

Dataset

To train the model, I used two different types of model. The dataset I used is listed below.

The reason why I thought of mixing two dataset is very simple. To increase the number of the dataset, and to prevent the overfitting.

In total,
-117,053 articles
-56,445 real news
-60,608 fake news

Model Overview

This time, I made the model with BERT. BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art pre-trained language model developed by Google. BERT's bidirectional approach, pre-training on large datasets, and ability to be fine-tuned for specific tasks make it one of the most powerful and widely used models in NLP.
This time, the flow of fine tuning is done by attaching the NN (Neural Network) of the classifier behind the trained BERT model. This time it is Real or Fake binary classification, so the destination of the classifier in the figure will be Real and Fake.

Model Definition

class BertForSequenceClassification(nn.Module):
    def __init__(self, activation_function, withdropout, hidden_size=768, num_labels=2, dropout_rate=0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        if activation_function == 'relu':
            self.activation_function = nn.ReLU()
        elif activation_function == 'sigmoid':
            self.activation_function = nn.Sigmoid()
        if withdropout == 'withDropout':
            self.dropout = nn.Dropout(dropout_rate)
        else:
            self.dropout = nn.Dropout(0)
        self.dropout = nn.Dropout(dropout_rate)
        self.classifier = nn.Linear(hidden_size, num_labels)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_embedding = outputs.last_hidden_state[:, 0, :] 
        x = self.dropout(cls_embedding)
        logits = self.classifier(x)
        return logits

Comparison Model

To see how the BERT model is superior compared to other models, I have created other models too. Here is a list of the models I created.

Model 1 : BERT Model, learning rate = 1e-5, Dropout=0.5
Model 2 : BERT Model, learning rate = 1e-5, Dropout=0
Model 3 : BERT Model, learning rate = 5e-5, Dropout=0.5
Model 4 : BERT Model, learning rate = 5e-5, Dropout=0
Model 5 : CNN+LSTM
Model 6 : CNN
Model 7 : LSTM
Model 8 : GRU
Model 9 : CNN + GRU

Data preparation

I eliminated other irrelevant features like the title of the article, subject, date, etc. I left only the text data and label(real or fake) in the datafile. Also, it is known that inclusion of the special characters or words like emoji, URL and so on, will drop the accuracy of the language model. For the prevention, I removed all those characters from the dataset.

Precise Setting

Here is the precise setting used for trainig

Tokenizer: BertTokenizer
Epochs: 4 epochs for BERT model, 5 epochs for other models
Dataset: Train data:Validation data:Test data = 8:1:1

Training

I did the training in the code below.

  from torch import nn
from transformers import BertModel
from torch.optim import Adam
from tqdm import tqdm
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score 

class BertClassifier(nn.Module):
    def __init__(self, lr, withdropout, dropout=0.5):
        super(BertClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased')
        self.lr = lr
        self.withdropout = withdropout
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 1) 
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_id, mask):
        mask = mask.squeeze(1)
        outputs = self.bert(input_ids=input_id, attention_mask=mask, return_dict=True)
        pooled_output = outputs.pooler_output
        if self.withdropout == 'withDropout':     
            dropout_output = self.dropout(pooled_output)
        elif self.withdropout == 'withoutDropout':
            dropout_output = pooled_output
        linear_output = self.linear(dropout_output)
        final_layer =  linear_output

        return final_layer


def train(model, train_data, val_data, epochs):
    print(f'model{model_number}:')
    train, val = Dataset(train_data), Dataset(val_data)
    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    criterion = nn.BCEWithLogitsLoss()
    optimizer = Adam(model.parameters(), lr= model.lr)
    if use_cuda:
            model = model.cuda()
            criterion = criterion.cuda()
    for epoch_num in range(epochs):
            model.train()
            total_acc_train = 0
            total_loss_train = 0
            train_total_samples = 0
            for train_input, train_label in tqdm(train_dataloader):        
                train_label = train_label.unsqueeze(1).to(device)
                mask = train_input['attention_mask'].to(device)
                input_id = train_input['input_ids'].squeeze(1).to(device)

                output = model(input_id, mask)

                batch_loss = criterion(output, train_label.float())
                total_loss_train += batch_loss.item()
                model.zero_grad()
                optimizer.zero_grad()
                batch_loss.backward()
                optimizer.step()

                train_predictions = (torch.sigmoid(output) >= 0.5).float()
                total_acc_train += (train_predictions == train_label).sum().item()
                train_total_samples += train_label.size(0)

            total_acc_val = 0
            total_loss_val = 0
            val_total_samples = 0

            with torch.no_grad():
                for val_input, val_label in val_dataloader:
                    val_label = val_label.unsqueeze(1).to(device)
                    mask = val_input['attention_mask'].to(device)
                    input_id = val_input['input_ids'].squeeze(1).to(device)

                    output = model(input_id, mask)

                    batch_loss = criterion(output, val_label.float())
                    total_loss_val += batch_loss.item()

                    # print(output)

                    # val_predictions = torch.round(torch.sigmoid(output))
                    # total_acc_val += (val_predictions.squeeze() == val_label).sum().item()
                    val_predictions = (torch.sigmoid(output) >= 0.5).float()
                    total_acc_val += (val_predictions == val_label).sum().item()
                    val_total_samples += val_label.size(0)

                    # acc = (output.argmax(dim=1) == val_label).sum().item()
                    # total_acc_val += acc

            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} \
                | Train Accuracy: {total_acc_train / train_total_samples: .3f} \
                | Val Loss: {total_loss_val / len(val_data): .3f} \
                | Val Accuracy: {total_acc_val / val_total_samples: .3f}')


def evaluate(model, test_data):
    test = Dataset(test_data)
    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model = model.cuda()

    all_labels = []
    all_predictions = []

    with torch.no_grad():
        for test_input, test_label in test_dataloader:
            model.eval()
            test_label = test_label.to(device)
            mask = test_input['attention_mask'].to(device)
            input_id = test_input['input_ids'].squeeze(1).to(device)

            output = model(input_id, mask)
            predictions = (torch.sigmoid(output) >= 0.5).float()

            all_labels.extend(test_label.cpu().numpy().tolist())
            all_predictions.extend(predictions.cpu().numpy().tolist())

    all_labels = [label[0] if isinstance(label, list) else label for label in all_labels]
    all_predictions = [pred[0] if isinstance(pred, list) else pred for pred in all_predictions]

    accuracy = accuracy_score(all_labels, all_predictions)
    precision = precision_score(all_labels, all_predictions, zero_division=0)
    recall = recall_score(all_labels, all_predictions, zero_division=0)
    f1 = f1_score(all_labels, all_predictions)

    print(f"Accuracy: {accuracy:.6f}")
    print(f"Precision: {precision:.6f}")
    print(f"Recall: {recall:.6f}")
    print(f"F1 Score: {f1:.6f}")
    print("\nClassification Report:")
    print(classification_report(all_labels, all_predictions, zero_division=0))

#---------------------------------
EPOCHS = 4

model1 = BertClassifier(1e-5, 'withDropout')
model2 = BertClassifier(5e-5, 'withDropout')
model3 = BertClassifier(1e-5, 'withoutDropout')
model4 = BertClassifier(5e-5, 'withoutDropout')
models = [model3, model4]

model_number = 0

for model in models:
    model_number+=1
    train(model, df_train, df_val, EPOCHS)
    evaluate(model, df_test)

Result

Here is the result that I have got.

As I thought, the BERT base model retains high accuracy thant any other model. model1 got the accuracy 61% which is the highest among all the models.

But more surprisingly, BERT model with the learning rate 5e-5 got the worst accuracy among all the models. I guess, since the learning rate was big enough, the update range might became too large and losses didn't not converge or the model might not fit properly. I would like to pursue this in the future if I have time...

But it is also true that, if you train with the right hyperparameters like model1, it'll show the great power.

BERT has the property of being able to create task-specific models with fine tuning after pre-training. Therefore, I felt that the difference in performance was more open compared to models that are trained from the beginning, such as CNNs and LSTMs; the BERT-based model has one smaller epoch count, but it still performs well, which proves that it has superior interpretation ability from the pre-training.

Conclusion

This time, I have created the fake news detection model using BERT. BERT is a REALLY powerful model to address large text data. With the proper setting, it can solve language task effectively and efficiently compared to many of the models.

If I have the time, I would like to solve the same task using more powerful models like RoBERT, GPT and so on. I will keep that for the next time.

THANK YOU FOR READING :)))))))