Language Model Implementation (Bigram Model)

#machinelearning #nlp #python #tutorial

Language models are one of the most important parts of Natural Language Processing. Here in this blog, I am implementing the simplest of the language models. The model implemented here is a "Statistical Language Model". I have used "BIGRAMS" so this is known as Bigram Language Model.

In Bigram language model we find bigrams which means two words coming together in the corpus(the entire collection of words/sentences).

For example -

In the sentence "DEV is awesome and user friendly" the bigrams are :

"DEV is", "is awesome", "awesome and", "and user", "user friendly"

In this code the readData() function is taking four sentences which form the corpus. The sentences are

This is a dog
This is a cat
I love my cat
This is my name

and these sentences are split to find the atomic words which form the vocabulary.

Then there is a function createBigram() which finds all the possible Bigrams the Dictionary of Bigrams and Unigrams along with their frequency i.e. how many times they occur in the corpus.

Then the function calcBigramProb() is used to calculate the probability of each bigram. The formula for which is

It is in terms of probability we then use count to find the probability. Which is basically

Then we use these probabilities to find the probability of next word by using the chain rule or we find the probability of the sentence like we have used in this program. We find the probability of the sentence "This is my cat" in the program given below.


def readData():
    data = ['This is a  dog','This is a cat','I love my cat','This is my name ']
    dat=[]
    for i in range(len(data)):
        for word in data[i].split():
            dat.append(word)
    print(dat)
    return dat

def createBigram(data):
   listOfBigrams = []
   bigramCounts = {}
   unigramCounts = {}
   for i in range(len(data)-1):
      if i < len(data) - 1 and data[i+1].islower():

         listOfBigrams.append((data[i], data[i + 1]))

         if (data[i], data[i+1]) in bigramCounts:
            bigramCounts[(data[i], data[i + 1])] += 1
         else:
            bigramCounts[(data[i], data[i + 1])] = 1

      if data[i] in unigramCounts:
         unigramCounts[data[i]] += 1
      else:
         unigramCounts[data[i]] = 1
   return listOfBigrams, unigramCounts, bigramCounts


def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):
    listOfProb = {}
    for bigram in listOfBigrams:
        word1 = bigram[0]
        word2 = bigram[1]
        listOfProb[bigram] = (bigramCounts.get(bigram))/(unigramCounts.get(word1))
    return listOfProb


if __name__ == '__main__':
    data = readData()
    listOfBigrams, unigramCounts, bigramCounts = createBigram(data)

    print("\n All the possible Bigrams are ")
    print(listOfBigrams)

    print("\n Bigrams along with their frequency ")
    print(bigramCounts)

    print("\n Unigrams along with their frequency ")
    print(unigramCounts)

    bigramProb = calcBigramProb(listOfBigrams, unigramCounts, bigramCounts)

    print("\n Bigrams along with their probability ")
    print(bigramProb)
    inputList="This is my cat"
    splt=inputList.split()
    outputProb1 = 1
    bilist=[]
    bigrm=[]

    for i in range(len(splt) - 1):
        if i < len(splt) - 1:

            bilist.append((splt[i], splt[i + 1]))

    print("\n The bigrams in given sentence are ")
    print(bilist)
    for i in range(len(bilist)):
        if bilist[i] in bigramProb:

            outputProb1 *= bigramProb[bilist[i]]
        else:

            outputProb1 *= 0
    print('\n' + 'Probablility of sentence \"This is my cat\" = ' + str(outputProb1))

Output

['This', 'is', 'a', 'dog', 'This', 'is', 'a', 'cat', 'I', 'love', 'my', 'cat', 'This', 'is', 'my', 'name']

All the possible Bigrams are
[('This', 'is'), ('is', 'a'), ('a', 'dog'), ('This', 'is'), ('is', 'a'), ('a', 'cat'), ('I', 'love'), ('love', 'my'), ('my', 'cat'), ('This', 'is'), ('is', 'my'), ('my', 'name')]

Bigrams along with their frequency
{('This', 'is'): 3, ('is', 'a'): 2, ('a', 'dog'): 1, ('a', 'cat'): 1, ('I', 'love'): 1, ('love', 'my'): 1, ('my', 'cat'): 1, ('is', 'my'): 1, ('my', 'name'): 1}

Unigrams along with their frequency
{'This': 3, 'is': 3, 'a': 2, 'dog': 1, 'cat': 2, 'I': 1, 'love': 1, 'my': 2}

Bigrams along with their probability
{('This', 'is'): 1.0, ('is', 'a'): 0.6666666666666666, ('a', 'dog'): 0.5, ('a', 'cat'): 0.5, ('I', 'love'): 1.0, ('love', 'my'): 1.0, ('my', 'cat'): 0.5, ('is', 'my'): 0.3333333333333333, ('my', 'name'): 0.5}

The bigrams in given sentence are
[('This', 'is'), ('is', 'my'), ('my', 'cat')]

Probablility of sentence "This is my cat" = 0.16666666666666666

The problem with this type of language model is that if we increase the n in n-grams it becomes computation intensive and if we decrease the n then long term dependencies are not taken into consideration. Also if an unknown word comes in the sentence then the probability becomes 0. This problem of zero probability can be solved with a method known as Smoothing. In Smoothing, we assign some probability to unknown words also. Two very famous smoothing methods are