To Begin With
Let me share what I made on a CTU class. Recently, the growth of the generative AI is significant. Our IT lives have been enriched by generativeAI such as ChatGPT,Copilot, etc. However, it is not only the expectation that life will become more convenient, but also the risk of misuse by criminals and other criminals.
Fake news is one of the concerned use of the AI. Fake news is a powerful tool of misinformation that can distort public perception, incite fear, and undermine trust in legitimate sources, posing a significant threat to democracy and societal harmony.
With this back ground, I decided to make the model that can detect whether the article is real or fake.
Dataset
To train the model, I used two different types of model. The dataset I used is listed below.
https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset/data
https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets
The reason why I thought of mixing two dataset is very simple. To increase the number of the dataset, and to prevent the overfitting.
In total,
-117,053 articles
-56,445 real news
-60,608 fake news
Model Overview
This time, I made the model with BERT. BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art pre-trained language model developed by Google. BERT's bidirectional approach, pre-training on large datasets, and ability to be fine-tuned for specific tasks make it one of the most powerful and widely used models in NLP.
This time, the flow of fine tuning is done by attaching the NN (Neural Network) of the classifier behind the trained BERT model. This time it is Real or Fake binary classification, so the destination of the classifier in the figure will be Real and Fake.
Model Definition
class BertForSequenceClassification(nn.Module):
def __init__(self, activation_function, withdropout, hidden_size=768, num_labels=2, dropout_rate=0.1):
super().__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
if activation_function == 'relu':
self.activation_function = nn.ReLU()
elif activation_function == 'sigmoid':
self.activation_function = nn.Sigmoid()
if withdropout == 'withDropout':
self.dropout = nn.Dropout(dropout_rate)
else:
self.dropout = nn.Dropout(0)
self.dropout = nn.Dropout(dropout_rate)
self.classifier = nn.Linear(hidden_size, num_labels)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
cls_embedding = outputs.last_hidden_state[:, 0, :]
x = self.dropout(cls_embedding)
logits = self.classifier(x)
return logits
Comparison Model
To see how the BERT model is superior compared to other models, I have created other models too. Here is a list of the models I created.
Model 1 : BERT Model, learning rate = 1e-5, Dropout=0.5
Model 2 : BERT Model, learning rate = 1e-5, Dropout=0
Model 3 : BERT Model, learning rate = 5e-5, Dropout=0.5
Model 4 : BERT Model, learning rate = 5e-5, Dropout=0
Model 5 : CNN+LSTM
Model 6 : CNN
Model 7 : LSTM
Model 8 : GRU
Model 9 : CNN + GRU
Data preparation
I eliminated other irrelevant features like the title of the article, subject, date, etc. I left only the text data and label(real or fake) in the datafile. Also, it is known that inclusion of the special characters or words like emoji, URL and so on, will drop the accuracy of the language model. For the prevention, I removed all those characters from the dataset.
Precise Setting
Here is the precise setting used for trainig
- Tokenizer: BertTokenizer
- Epochs: 4 epochs for BERT model, 5 epochs for other models
- Dataset: Train data:Validation data:Test data = 8:1:1
Training
I did the training in the code below.
from torch import nn
from transformers import BertModel
from torch.optim import Adam
from tqdm import tqdm
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
class BertClassifier(nn.Module):
def __init__(self, lr, withdropout, dropout=0.5):
super(BertClassifier, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-cased')
self.lr = lr
self.withdropout = withdropout
self.dropout = nn.Dropout(dropout)
self.linear = nn.Linear(768, 1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, input_id, mask):
mask = mask.squeeze(1)
outputs = self.bert(input_ids=input_id, attention_mask=mask, return_dict=True)
pooled_output = outputs.pooler_output
if self.withdropout == 'withDropout':
dropout_output = self.dropout(pooled_output)
elif self.withdropout == 'withoutDropout':
dropout_output = pooled_output
linear_output = self.linear(dropout_output)
final_layer = linear_output
return final_layer
def train(model, train_data, val_data, epochs):
print(f'model{model_number}:')
train, val = Dataset(train_data), Dataset(val_data)
train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
criterion = nn.BCEWithLogitsLoss()
optimizer = Adam(model.parameters(), lr= model.lr)
if use_cuda:
model = model.cuda()
criterion = criterion.cuda()
for epoch_num in range(epochs):
model.train()
total_acc_train = 0
total_loss_train = 0
train_total_samples = 0
for train_input, train_label in tqdm(train_dataloader):
train_label = train_label.unsqueeze(1).to(device)
mask = train_input['attention_mask'].to(device)
input_id = train_input['input_ids'].squeeze(1).to(device)
output = model(input_id, mask)
batch_loss = criterion(output, train_label.float())
total_loss_train += batch_loss.item()
model.zero_grad()
optimizer.zero_grad()
batch_loss.backward()
optimizer.step()
train_predictions = (torch.sigmoid(output) >= 0.5).float()
total_acc_train += (train_predictions == train_label).sum().item()
train_total_samples += train_label.size(0)
total_acc_val = 0
total_loss_val = 0
val_total_samples = 0
with torch.no_grad():
for val_input, val_label in val_dataloader:
val_label = val_label.unsqueeze(1).to(device)
mask = val_input['attention_mask'].to(device)
input_id = val_input['input_ids'].squeeze(1).to(device)
output = model(input_id, mask)
batch_loss = criterion(output, val_label.float())
total_loss_val += batch_loss.item()
# print(output)
# val_predictions = torch.round(torch.sigmoid(output))
# total_acc_val += (val_predictions.squeeze() == val_label).sum().item()
val_predictions = (torch.sigmoid(output) >= 0.5).float()
total_acc_val += (val_predictions == val_label).sum().item()
val_total_samples += val_label.size(0)
# acc = (output.argmax(dim=1) == val_label).sum().item()
# total_acc_val += acc
print(
f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} \
| Train Accuracy: {total_acc_train / train_total_samples: .3f} \
| Val Loss: {total_loss_val / len(val_data): .3f} \
| Val Accuracy: {total_acc_val / val_total_samples: .3f}')
def evaluate(model, test_data):
test = Dataset(test_data)
test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
if use_cuda:
model = model.cuda()
all_labels = []
all_predictions = []
with torch.no_grad():
for test_input, test_label in test_dataloader:
model.eval()
test_label = test_label.to(device)
mask = test_input['attention_mask'].to(device)
input_id = test_input['input_ids'].squeeze(1).to(device)
output = model(input_id, mask)
predictions = (torch.sigmoid(output) >= 0.5).float()
all_labels.extend(test_label.cpu().numpy().tolist())
all_predictions.extend(predictions.cpu().numpy().tolist())
all_labels = [label[0] if isinstance(label, list) else label for label in all_labels]
all_predictions = [pred[0] if isinstance(pred, list) else pred for pred in all_predictions]
accuracy = accuracy_score(all_labels, all_predictions)
precision = precision_score(all_labels, all_predictions, zero_division=0)
recall = recall_score(all_labels, all_predictions, zero_division=0)
f1 = f1_score(all_labels, all_predictions)
print(f"Accuracy: {accuracy:.6f}")
print(f"Precision: {precision:.6f}")
print(f"Recall: {recall:.6f}")
print(f"F1 Score: {f1:.6f}")
print("\nClassification Report:")
print(classification_report(all_labels, all_predictions, zero_division=0))
#---------------------------------
EPOCHS = 4
model1 = BertClassifier(1e-5, 'withDropout')
model2 = BertClassifier(5e-5, 'withDropout')
model3 = BertClassifier(1e-5, 'withoutDropout')
model4 = BertClassifier(5e-5, 'withoutDropout')
models = [model3, model4]
model_number = 0
for model in models:
model_number+=1
train(model, df_train, df_val, EPOCHS)
evaluate(model, df_test)
Result
Here is the result that I have got.
As I thought, the BERT base model retains high accuracy thant any other model. model1 got the accuracy 61% which is the highest among all the models.
But more surprisingly, BERT model with the learning rate 5e-5 got the worst accuracy among all the models. I guess, since the learning rate was big enough, the update range might became too large and losses didn't not converge or the model might not fit properly. I would like to pursue this in the future if I have time...
But it is also true that, if you train with the right hyperparameters like model1, it'll show the great power.
BERT has the property of being able to create task-specific models with fine tuning after pre-training. Therefore, I felt that the difference in performance was more open compared to models that are trained from the beginning, such as CNNs and LSTMs; the BERT-based model has one smaller epoch count, but it still performs well, which proves that it has superior interpretation ability from the pre-training.
Conclusion
This time, I have created the fake news detection model using BERT. BERT is a REALLY powerful model to address large text data. With the proper setting, it can solve language task effectively and efficiently compared to many of the models.
If I have the time, I would like to solve the same task using more powerful models like RoBERT, GPT and so on. I will keep that for the next time.
THANK YOU FOR READING :)))))))
Top comments (0)