DEV Community: Matthew

NLP Transfer Learning with BERT

Matthew — Tue, 05 Nov 2019 01:49:57 +0000

Recently, I was working on a Natural Language Processing (NLP) project where the goal was to classify fake news based on the text contained in the headline and body text. I tried a few different methods including a simple baseline model. However, I picked this NLP project so that I could have a first exposure to working with neural networks since I have not worked with them much previously. Through this process, I discovered the power of using pre-trained BERT neural networks.

But first, some background.

Background

Fake news is a growing problem and the popularity of social media as a fast and easy way to share news has only exacerbated the issue. For this project, I was attempting to classify news as real or fake using the FakeNewsNet database created by The Data Mining and Machine Learning lab (DMML) at ASU. The database was created using news articles shared on Twitter over the past few years. The labels of Real or Fake were generated by checking the news stories against two fact checking tools: Politifact (political news) and Gossipcop (primarily entertainment news but other articles as well). If you want more information about the dataset please see https://arxiv.org/abs/1809.01286 and https://github.com/KaiDMML/FakeNewsNet.

Baseline

For fake news classification on this dataset, I used 9,829 data points. I trained on 80% of the dataset and tested on the other 20%. I did not use any of the additional features in my data (author, source of article, date published, etc) so that I could focus only on the title and text of the articles. I combined the two text fields in order to train my model. For NLP, you have to vectorize your words in order to feed into the model. My first model was a simple sklearn pipeline using Count Vectorization and TD-IDF. Count Vectorization splits words into tokens and then counts it based on how many occurrences it has in the whole document. TD-IDF stands for term frequency times inverse document-frequency which is a weighting scheme that calculates how many times word tokens appear in a given document. This system lessens the impact of words that appear broadly within all of the documents (samples of text) being analyzed. These two methods create new numeric features for the text based on these calculations. The features were then passed into a simple Logistic Regression model for classification which yielded an accuracy of 79%. However, this model only accounts for how often a word occurs in the document relative to the whole vocabulary in classifying real from fake.

A more sophisticated model can be achieved by neural networks and deep learning.

Neural Networks for NLP and Why BERT is so Important

I decided to move on to a neural network yet there are so many different types and architectures of neural networks for use in NLP. I found many instances online of RNNs(Recurrent Neural Networks) and LSTMs (Long-Short Term Memory Units) being used for these problems. This is due to their architectures being set up to retain information throughout the process as well as being able to take in word embeddings that account for more than just individual words. So I set out to build a RNN and got most of the way to making one that worked well but I discovered transfer learning.

Work Smarter, Not Harder or How I Stopped Worrying and Learned to Love Transfer Learning

Transfer learning is a concept in deep learning where you take knowledge gained from one problem and apply it to a similar problem. While this may seem purely conceptual, it is actually applied quite regularly in the field of machine learning. Practically, it involves taken a pre-trained model that has already been trained on a large amount of data and then retraining the last layer on domain-specific data for the related problem. This can be a powerful method when you don't have the massive amounts of data, training time, or computational power to train a neural network from scratch. It has been used fairly regularly in in image classification problems and in the last few years has begun to be applied in NLP. This is because of the development of BERT.

Why BERT?

BERT is a powerful model in transfer learning for several reasons. First, it is similar to OpenAI's GPT2 that is based on the transformer(an encoder combined with a decoder). However, that model can only read words uni-directionally which does not make it ideal for classification. BERT reads words in both directions(bidirectionally) and thus can read words before and after the word in a sequence. Another important advantage to BERT is that it is a masked language model that masks 15% of the tokens fed into the model. These two factors make it very good at a variety of word classification tasks.

Those are important to the magic behind BERT but the true power lies in its use in NLP transfer learning. BERT is trained on a large amount of words and articles from Wikipedia. The amount of data it is trained on is much more than most people would be able to train on for specific problems. Therefore, you can import a pre-trained BERT and then retrain just the final layer on context-specific data to create a powerful classification neural network model in a short amount of time. Using a pre-trained BERT, I was able to achieve an accuracy of 71% without tuning many parameters.

This proves that if you are doing a NLP classification problem, your first instinct should be to utilize the work that is already out there before you build something from scratch.

If you want to take a look at the full presentation, please visit:

https://www.canva.com/design/DADp9gRmr78/ggbpzPMzuxSUXoPQeIvSDw/view?utm_content=DADp9gRmr78&utm_campaign=designshare&utm_medium=link&utm_source=sharebutton#1

If you wish to look at the code and work through these things yourself, please visit:

https://github.com/mdani38/Module-4-Project_houston-ds-082619

Links and References

https://github.com/KaiDMML/FakeNewsNet

https://arxiv.org/abs/1809.01286

https://arxiv.org/abs/1712.07709

https://arxiv.org/abs/1708.01967

https://arxiv.org/pdf/1810.04805

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

https://giphy.com/gifs/reaction-creepy-MtmFbGJ6YsUEg ↩
Word cloud showing highest occurring words from the fake news and the real news in the dataset https://www.canva.com/design/DADp9gRmr78/ggbpzPMzuxSUXoPQeIvSDw/view?utm_content=DADp9gRmr78&utm_campaign=designshare&utm_medium=link&utm_source=sharebutton#6 ↩

Getting Good at Git

Matthew — Fri, 27 Sep 2019 15:32:48 +0000

Last week I worked on a project in which I was tasked with using movie data from different sources in order to gather insights on what factors create value in the film industry. I had written some code on using some loops to make multiple requests to the The Movie Database API. The function gathered nearly 10000 movies from the discover endpoint then used the loop to use the unique movie ids to request from a different endpoint to gather detailed movie information and write it to a data frame.

So far, so good...

It was my intention to write this blog post on that process. After completing the long process of finishing the requests and gathering all of the data into a data frame, I exported it to a csv file that I sent to a teammate. However, an incident with git and github deleted all of the files in my project repository which wiped out all the code related to the process.

This brings me to the topic of this post, getting git to work for you and not against you.

Github repositories are a powerful tool... but understand what you are doing

What happened in my case is that I forked a repository but cloned the original repository so I ended up pushing all of my changes up to the wrong location. We were able to get the files off of the repository and change my upstream but in that process, all the files were wiped off of my local directory.

Platforms that host repositories are a great way to share versions of code and store files off of your working computer. But if you do not use git properly in your use of GitHub or Gitlab, it can can only hamper your data science or coding project. So, I will share some quick tips that I have gathered from my own recent struggles.

Step 1: Fork it

If there is a repository on GitHub that you are interested in using, the important first step is to fork it. This will create a copy of the repository linked to your account. It may not seem vital if you are just trying to access some code or some files but it will save you heartbreak in the long run.

Step 2: Clone the repository

The next step is to clone the repository to your local machine using the git clone command in your terminal but make sure that you are paying attention to the path that you are pasting into your terminal. This is because when you clone the repository into a local directory, git is setting up the connection the upstream remote repository hosted on GitHub. Practically you will navigate to the directory that you want to use on your local machine and then use the git clone command with the copied link to the forked repository.

Step 3: Work

Great. So now we are all set up with a directory that is connected to a remote repository using git. You can now start working with files in the local directory to your heart's content. The issues that arise often occur in the next step when you are trying to upload files to the remote repository.

Step 4: .gitignore

Before you start committing changes and pushing files to the repository, you will want to set up your .gitignore. This is a file that is stored in your local directory connected to the remote using git. You can add file paths to the body of this file and it will ignored those files when you start using git to move files to the remote location. It is recommended to put config files, api keys, environment files in the .gitignore but you can add any files that you do not want on your public repository.

Step 5: Adding files to the staging area

Moving files from your local repository to the remote location on GitHub is a mult-stage process and has to be done in order to be successful. The first step is to add the files to the staging area. The command is git add filename. Additionally, you can use git add --all to add all files and subdirectories to the staging area. If your .gitignore is set up properly, it will still continue to ignore the specified files and add everything else. There are also ways to force git to add the files anyway using git add filename -f.

Step 6: Commit

Once you have added files using git add, the next step is to commit changes. One thing that I like to do first (and in every stage of this process) is to use the command git status. This will tell you what files are in the staging area, as well as what files are different from those in the repository.

Now you are ready to commit your changes of the files that you have added to the local repository. Again, this will only affect the files currently in the staging area. It is required to add a commit message and it is a good habit to make this meaningful as once you push the files to your public repository, the message history will be showing. The command is git commit -m "Commit Message".

After you have completed this step, you will have committed changes using git. However, this is only on the local git repository in that directory. There is another step to move committed files to the remote repository hosted publicly on GitHub as well to pull the files that others have put on it.

Step 7: Push it (and pull)

GitHub is a powerful platform because it allows multiple people to work using a public repository and add files and remove them. It also allows different branches of the repository to work with. If you are working on a project with someone else, you will most likely have to add files to a shared repository and pull files your partners have added.

Once you have committed your changes, you are ready to push those changes to the remote upstream repository. However, it is best to pull the changes from the remote repository first. There will be conflicts if you are trying to push changes to a repository that has changed. Using the command git pull branch, you will pull any changes from the repository and add them to your local directory.

Now you can push committed changes up to the repository. This is done using the command git push branch. By default, it will push and pull to/from the origin master branch but you can specify otherwise. The final step is to check on the public repository if the changes have registered.

Quick tips

Fork it

Clone it into a new directory and know where that is

Set up your gitignore

Use git status often to see what the changes are in your local repository and the stages of git add, git commit, git push, etc

Make meaningful commit messages to track your commit history

Pull before you push

Use different filenames than your partners that use the repository. This will prevent conflicts and the process of having to resolve these merge conflicts

Read the git docs for more tips about how to manipulate git in your favor https://git-scm.com/docs

⁴

Banner image source: https://techcrunch.com/2019/01/07/github-free-users-now-get-unlimited-private-repositories/

Data, Inductive Reasoning and Procedural Crime Dramas: Why I Decided to Pursue Data Science

Matthew — Thu, 05 Sep 2019 23:59:37 +0000

In the past month, I made the choice to pursue a Data Science bootcamp. It may seem that this decision has come out of nowhere, but there are many reasons that have driven me to enter into the exciting world of Data Science.

While the growing opportunities for careers for data scientists are appealing, the use of data to solve problems combined with programming has always interested me. In the past few years, I have dabbled on and off with using programming and different techniques to manipulate the data I used in my pursuits as a scientist. This pivot represents a change in my possible career path but it stays true to the interests that I have always had.

It All Started With Science

I have been interested in the pursuits of science for most of my life. From a young age, I have always been interested in the tools of science and how we can use them to understand more about the world around us. A scientific approach drives how I interact with all of the information around me.

My interests in science drove me to get a BS and MS in Geology as well as publication based on my research on seismic reconstructions of topography beneath ice sheet deposits in the Ross Sea of Antarctica. Past pursuits have already given me plenty of experiences with analyzing, manipulating and visualizing large datasets.

So why did I decide to go into Data Science in particular? For me, it is the concepts of inductive reasoning, data-based problem solving, and effective visualization of results. These concepts are also shown in another thing I love, detective shows.

Detective Shows? What does that have to do with anything?

Yes, detective shows or procedural crime dramas are related to my interest in pursuing Data Science. While these types of tv shows are entertaining and I have seen most of them (CSI is my favorite), my love of detective shows is tied to my use of logical reasoning.

While every episode is different, the procedure is the same. The characters (such as Jessica Fletcher pictured above) is presented with a mystery or problem. They then must examine the evidence at the scene of the crime. Additionally, more information can be gathered about the circumstances around the situation. They need to make determinations about what evidence is useful and what evidence is not as helpful. Then they have to draw conclusions based on the information they have.

For me, this is a similar workflow to Data Science. The problem, goals and challenges are always different but a data scientist must use the data they have. Then they determine if there is new data that they can gather to help with the problem. There is a determination that must be made about what information is useful to focus on in large datasets and what data is not as helpful. Then finally, they can analyze that data and use different techniques to model and visualize it to draw conclusions.

A New Path

These are just my first thoughts about what I think about the field of Data Science and the use of large data sets to solve problems. Many of these concepts are new to me and I have certainly been working hard at it. However, I am enjoying the learning process and I am excited to begin my journey down this path and see where it leads me.