DEV Community

Cover image for Building and using Markov chains - AI for text generation - Part II

Posted on

Building and using Markov chains - AI for text generation - Part II

In the previous post we saw some theory behind Markov chains. In this one we're going to define the datastructures and the process involved in the creation and usage of one.

Cover phto by Maria Orlova on Pexels

Important: Although some code chunks in this post might look like JS of Python, they are all just pseudo-code.

Data structures

Markov chains are directed graphs; the usual way to represent these is using an adjacency matrix. Each row will represent a word A and each column a word B; each cell will store the probability of B being the next word after A.

So if we had the following text: "How much wood would a woodchuck chuck if a woodchuck could chuck wood", the matrix would like like this:

\ a chuck could how if much wood woodchuck would
a 0 0 0 0 0 0 0 1 0
chuck 0 0 0 0 0.5 0 0.5 0 0
could 0 1 0 0 0 0 0 0 0
how 0 0 0 0 0 1 0 0 0
if 1 0 0 0 0 0 0 0 0
much 0 0 0 0 0 0 1 0 0
wood 0 0 0 0 0 0 0 0 1
woodchuck 0 0.5 0.5 0 0 0 0 0 0
would 1 0 0 0 0 0 0 0 0

Do you notice something special about this matrix? Most of its entries are zero! So instead of implementing a matrix as, for example, an array of arrays and waste a lot of space in zeroes, we should use a sparse matrix implementation with a dictionary or tuples and lists. Something like this:

(a, [(woodchuck, 1)]),
(chuck, [(if, 0.5), (wood, 0.5)]),
(could, [(chuck, 1)]),
(how, [(much, 1)]),
(if, [(a, 1)]),
(much, [(wood, 1)]),
(wood, [(would, 1)]),
(woodchuck, [(chuck, 0.5), (could, 0.5)])
(would, [(a, 1)])
Enter fullscreen mode Exit fullscreen mode

and finally, we could compress it even more if we tokenize the text, replacing words by code numbers. In this case, the tokens will be the positions in the sorted list of unique words(a is 1, chuck is 2, etc), like this:

[(8, 1)],
[(5, 0.5), (7, 0.5)],
[(2, 1)],
[(6, 1)],
[(1, 1)],
[(7, 1)],
[(9, 1)],
[(2, 0.5), (3, 0.5)],
[(1, 1)]
Enter fullscreen mode Exit fullscreen mode

Much better than the first matrix, right? Now let's see how we get the chain from the text

Building the chain

Preprocess the text

The first step consists on cleaning the text so we can get the more out of it. Some relevant steps here would be converting the whole text to lower case, expanding contractions and performing tokenization (replace words with their number of first appearance, for example). Steps like removing punctuation may be optional in this case, and other usual steps like lemmatization or removing stop words make no sense here.

An example of this would be:

  • Before preprocessing:
  I don't think Albert will come. I think he said he wouldn't.
Enter fullscreen mode Exit fullscreen mode
  • After expansion and conversion to lower case:
  i do not think albert will come. i think he said he would not.
Enter fullscreen mode Exit fullscreen mode
  • After tokenization, using the following regex to define tokens: ([^\W_]+)|([,.;:?!]+) (Tokens will be words, numbers and punctuation signs).
  token_dictionary = {
    'i': 1,
    'do': 2,
    'not': 3,
    'think': 4,
    'albert': 5,
    'will': 6,
    'come': 7,
    '.': 8,
    'he': 9,
    'said': 10,
    'would': 11
  tokenized_text = [-1, 1, 2, 3, 4, 5, 6, 7, 8, 1, 4, 9, 10, 9, 11, 3, -2]
Enter fullscreen mode Exit fullscreen mode

Wait, what are those -1 and -2 doing at the beginning and the end of the list? Well, they are special tokens reserved to mark just that: when a sentence starts and when it ends. They will be important when generating sentences with the chain.

Make a list of words and its followers

Now that we have a list with the tokenized text, we have to iterate it storing the number of appearances of each word and the frequency that other words appear after it:

    appearances = {}
    matrix = {}
    it = 1
    while it < tokenized_text.length 
        current = tokenized_text[it - 1]
        next = tokenized_text[it]

        // EXISTING WORD
        if current in appearances.keys
            // EXISTING PAIR
            if next in matrix[current].keys
            // NEW PAIR
                matrix[current] = {next: 1}

        // NEW WORD
            appearances[current] = 1
            matrix[current] = {next: 1}
        it = it + 1
    for word in matrix.keys
        for follower in matrix[word].keys
            matrix[word][follower] = matrix[word][follower] / appearances[word]
    return matrix
Enter fullscreen mode Exit fullscreen mode

And your Markov chain would be ready to use!

Generating text from the chain

Now, to generate the chain we will need a roulette selection function to pick an element from a word-probability dictionary.

roulette (words):
    selection = random_number() // floating between 0 and 1
    for candidate in words.keys
        probability = words[candidate]
        selection = selection - probability
        if selection < 0
            return candidate
Enter fullscreen mode Exit fullscreen mode

And now, using that function we can use the Markov chain to generate a new word token given the previous:

generate_next_token (matrix, token):
    return roulette(matrix[token])
Enter fullscreen mode Exit fullscreen mode

Remember the special tokens -1 and -2? This is where they come handy. Using them we are able to make the function that generates a full sentence:

generate_sentence_tokens (matrix):
    tokens = [-1]
    while tokens.last != -2
        tokens.push(generate_next_token(matrix, token.last))
    return tokens
Enter fullscreen mode Exit fullscreen mode

Finally, we just have to translate back tokens to words, ignoring the first and the last ones. You might also want to do some postprocessing to put uppercase where is due and fix the whitespaces, for example before punctuation signs. I leave that to you.


The pseudo-code I've shown here works for a first order Markov chain. In the next posts, I will explain how to do a chain of order N and implement it in several languages. I hope you found this useful, please leave your thoughts in the comments!

Recommended reading

You can follow me on Twitter! 🐦

Top comments (0)