DEV Community

Cover image for Anatomy of Language Models In NLP

Posted on

Anatomy of Language Models In NLP

I’m astonished and astounded by the vast array of tasks that can be performed with NLP – text summarization, generating completely new pieces of text, predicting what word comes next (Google’s autofill), among others. Do you know what is common among all these NLP tasks?

They are all powered by language models! Honestly, these language models are a crucial first step for most of the advanced NLP tasks.

Language models are an important component in the Natural Language Processing (NLP) journey. These language models power all the popular NLP applications we are familiar with like Google Assistant, Siri, Amazon’s Alexa, etc.

Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, Optical Character Recognition, handwriting recognition, information retrieval and other applications.

In this article, we will cover the length and breadth of language models. We will begin from basic language models that are basically statistical or probabilistic models and move to the State-of-the-Art language models that are trained using humongous data and are being currently used by the likes of Google, Amazon, and Facebook, among others. Some of which are mentioned in the blog given below written by me.

So, tighten your seatbelts and brush up your linguistic skills – we are heading into the wonderful world of Natural Language Processing!

Alt Text

Types of Language Models

There are basically two types of Language Models:

Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words.

Statistical Language Modeling, or Language Modeling and LM for short, is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it.

Neural Language Models: These are new players in the NLP town and have surpassed the statistical language models in their effectiveness. They use different kinds of Neural Networks to model language.

Recently, the use of neural networks in the development of language models has become very popular, to the point that it may now be the preferred approach. Neural network approaches are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation.

GPT-3 which is making a lot of buzz now-a-days is an example of Neural language model. BERT by Google is another popular Neural language model used in the algorithm of the search engine for next word prediction of our search query.

Introduction to Statistical language models

Alt Text

So, we have discussed what are statistical language models. Now let's take a deep dive in the concept of Statistical language models. A language model learns the probability of word occurrence based on examples of text. Simpler models may look at a context of a short sequence of words, whereas larger models may work at the level of sentences or paragraphs. Most commonly, language models operate at the level of words.

N-gram Language Models

Let’s understand N-gram with an example. Consider the following sentence:

“I love reading blogs on DEV and develop new products”

A 1-gram (or unigram) is a one-word sequence. For the above sentence, the unigrams would simply be: "I", "love", "reading", "blogs", "on", "DEV", "and", "develop", "new", "products".

A 2-gram (or bigram) is a two-word sequence of words, like "I love", "love reading", "on DEV"or "new products". And a 3-gram (or trigram) is a three-word sequence of words like "I love reading", "blogs on DEV" or "develop new products".

An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. If we have a good N-gram model, we can predict p(w | h) – what is the probability of seeing the word w given a history of previous words h – where the history contains n-1 words.

example - I love reading ___ , here we want to predict what is the word which will fill the dash based on the probabilities of the previous words.

We must estimate this probability to construct an N-gram model.

We compute this probability in two steps:

1) Apply the chain rule of probability

2) We then apply a very strong simplification assumption to allow us to compute p(w1…ws) in an easy manner.

The chain rule of probability is:

p( = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)

So what is the chain rule? It tells us how to compute the joint probability of a sequence by using the conditional probability of a word given previous words.

But we do not have access to these conditional probabilities with complex conditions of up to n-1 words. So how do we proceed?

This is where we introduce a simplification assumption. We can assume for all conditions, that:

p(wk | w1...wk-1) = p(wk | wk-1)

Here, we approximate the history (the context) of the word wk by looking only at the last word of the context. This assumption is called the Markov assumption. It is an example of Bigram model. The same concept can be enhanced further for example for trigram model the formula will be

p(wk | w1...wk-1) = p(wk |wk-2 wk-1)

These models have a basic problem that they give the probability to zero if an unknown word is seen so the concept of smoothing is used. In smoothing we assign some probability to the unseen words. There are different types of smoothing techniques like - Laplace smoothing, Good Turing and Kneser-ney smoothing. The other problem is that they are very compute intensive for large histories and due to markov assumption there is some loss.

Introduction to Neural language models

Alt Text

Neural language models have some advantages over probabilistic models like they don’t need smoothing, they can handle much longer histories, and they can generalize over contexts of similar words. For a training set of a given size, a neural language model has much higher predictive accuracy than an n-gram language model.

On the other hand, there is a cost for this improved performance: neural net language models are strikingly slower to train than traditional language models,and so for many tasks an n-gram language model is still the right tool.

In neural language models, the prior context is represented by embeddings of the previous words. This allows neural language models to generalize to unseen data much better than n-gram language models.

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network.Each word is represented by a real-valued vector, often tens or hundreds of dimensions.

Some of the word embedding techniques are Word2Vec and GloVe. To know more about Word2Vec read this super illustrative blog. GloVe is an extended version of Word2Vec.

The Neural language models were first based on RNNs and word embeddings. Then the concept of LSTMs, GRUs and Encoder-Decoder came along. The recent advancement is the discovery of Transformers which has changed the field of Language Modelling drastically. Some of the most famous language models like BERT, ERNIE, GPT-2 and GPT-3, RoBERTa are based on Transformers.

You can learn about the abbreviations from the given below blog

The RNNs were then stacked and used with bidirection but they were unable to capture long term dependencies. LSTMs and GRUs were introduced to counter this drawback.

The transformers form the basic building blocks of the new neural language models. The concept of transfer learning is introduced which was a major breakthrough. The models were pretrained using large datasets like BERT is trained on entire English Wikipedia. Unsupervised learning was used for training of the models. GPT-2 is trained on a set of 8 million webpages. These models are then fine-tuned to perform different NLP tasks. Discussing about the in detail architecture of different neural language models will be done in further posts.

Hope you enjoyed the article and got a good insight into the world of language models.

Discussion (2)

waylonwalker profile image
Waylon Walker

I have used tokenization and lemmatization in the past. Where do they fall into the nlp techniques you mention?

amananandrai profile image
amananandrai Author

Lemmatization and tokenization are used in the case of text classification and sentiment analysis as far as I know. In case of Neural language models use word embeddings which find relation between various words and store them in vectors. Like it can find that king and queen have the same relation as boy and girl and which words are similar in meaning and which are far away in context. In case of statistical models we can use tokenization to find the different tokens. Neural models have there own tokenizers and based on these tokens only the next token is generated during the test phase and tokenization is done during the training phase. Lemmatization will cause a little bit of error here as it trims the words to base form thus resulting in a bit of error. Eg- the base form of is, are and am is be thus a sentence like " I be Aman" would be grammatically incorrect and this will occur due to lemmatization