DEV Community


Posted on • Updated on

NLP Transfer Learning with BERT

Recently, I was working on a Natural Language Processing (NLP) project where the goal was to classify fake news based on the text contained in the headline and body text. I tried a few different methods including a simple baseline model. However, I picked this NLP project so that I could have a first exposure to working with neural networks since I have not worked with them much previously. Through this process, I discovered the power of using pre-trained BERT neural networks.

Alt Text1

But first, some background.


Fake news is a growing problem and the popularity of social media as a fast and easy way to share news has only exacerbated the issue. For this project, I was attempting to classify news as real or fake using the FakeNewsNet database created by The Data Mining and Machine Learning lab (DMML) at ASU. The database was created using news articles shared on Twitter over the past few years. The labels of Real or Fake were generated by checking the news stories against two fact checking tools: Politifact (political news) and Gossipcop (primarily entertainment news but other articles as well). If you want more information about the dataset please see and

Alt Text


For fake news classification on this dataset, I used 9,829 data points. I trained on 80% of the dataset and tested on the other 20%. I did not use any of the additional features in my data (author, source of article, date published, etc) so that I could focus only on the title and text of the articles. I combined the two text fields in order to train my model. For NLP, you have to vectorize your words in order to feed into the model. My first model was a simple sklearn pipeline using Count Vectorization and TD-IDF. Count Vectorization splits words into tokens and then counts it based on how many occurrences it has in the whole document. TD-IDF stands for term frequency times inverse document-frequency which is a weighting scheme that calculates how many times word tokens appear in a given document. This system lessens the impact of words that appear broadly within all of the documents (samples of text) being analyzed. These two methods create new numeric features for the text based on these calculations. The features were then passed into a simple Logistic Regression model for classification which yielded an accuracy of 79%. However, this model only accounts for how often a word occurs in the document relative to the whole vocabulary in classifying real from fake.

A more sophisticated model can be achieved by neural networks and deep learning.

Neural Networks for NLP and Why BERT is so Important

I decided to move on to a neural network yet there are so many different types and architectures of neural networks for use in NLP. I found many instances online of RNNs(Recurrent Neural Networks) and LSTMs (Long-Short Term Memory Units) being used for these problems. This is due to their architectures being set up to retain information throughout the process as well as being able to take in word embeddings that account for more than just individual words. So I set out to build a RNN and got most of the way to making one that worked well but I discovered transfer learning.

Work Smarter, Not Harder or How I Stopped Worrying and Learned to Love Transfer Learning

Transfer learning is a concept in deep learning where you take knowledge gained from one problem and apply it to a similar problem. While this may seem purely conceptual, it is actually applied quite regularly in the field of machine learning. Practically, it involves taken a pre-trained model that has already been trained on a large amount of data and then retraining the last layer on domain-specific data for the related problem. This can be a powerful method when you don't have the massive amounts of data, training time, or computational power to train a neural network from scratch. It has been used fairly regularly in in image classification problems and in the last few years has begun to be applied in NLP. This is because of the development of BERT.


BERT is a powerful model in transfer learning for several reasons. First, it is similar to OpenAI's GPT2 that is based on the transformer(an encoder combined with a decoder). However, that model can only read words uni-directionally which does not make it ideal for classification. BERT reads words in both directions(bidirectionally) and thus can read words before and after the word in a sequence. Another important advantage to BERT is that it is a masked language model that masks 15% of the tokens fed into the model. These two factors make it very good at a variety of word classification tasks.

Those are important to the magic behind BERT but the true power lies in its use in NLP transfer learning. BERT is trained on a large amount of words and articles from Wikipedia. The amount of data it is trained on is much more than most people would be able to train on for specific problems. Therefore, you can import a pre-trained BERT and then retrain just the final layer on context-specific data to create a powerful classification neural network model in a short amount of time. Using a pre-trained BERT, I was able to achieve an accuracy of 71% without tuning many parameters.

This proves that if you are doing a NLP classification problem, your first instinct should be to utilize the work that is already out there before you build something from scratch.

If you want to take a look at the full presentation, please visit:

If you wish to look at the code and work through these things yourself, please visit:

Links and References


  2. Word cloud showing highest occurring words from the fake news and the real news in the dataset  

Top comments (0)