Automated text generator using Markov Chain

#machinelearning #automatedtext #textgenerator #markovchains

See this step by step guide on how the algorithm works with reference code provided.

Making computer generated text mimic human speech is fascinating and actually not that difficult for an effect that is sometimes convincing, but certainly entertaining. Markov Chain’s is one way to do this. It works by generating new text based on historical texts where the original sequencing of neighbouring words (or groups of words) is used to generate meaningful sentences.

What’s really interesting, is that you can take historical texts of a person, then generate new sentences which can sound similar to the way that person speaks. Alternatively, you can combine texts from two different people and get a mixed “voice”.

I played around this with texts of speeches from two great presidents:

What my Markov Chain generated which was “trained” using the combination of texts from Obama speeches and Bartlet scripts, is as follows:

‘Can I burn my mother in North Carolina for giving us a great night planned.’
‘And so going forward, I believe that we can build a bomb into their church.’
‘’Charlie, my father had grown up in the Situation Room every time I came in.’’
‘This campaign must be ballistic.’,

What is a Markov Chain in the context of a text generation?

For a more technical explanation, I think you can find plenty of resources out there. In simple terms, it is an algorithm which is used to generate a new outcome from a weighted list of words based on historical texts. Now that’s rather abstract. In more practical terms, in the scenario for text generation, it is a way to use historical texts, chop it up into individual words (or sets of words), and then randomly chose a given word then randomly chose the next likely words based on historical sequences. For example:

An example that shows the original text (A), the dictionary that gets generated of the words (B), and a sample text (C ) that was generated from randomly selecting words with selected words highlighted in red. The numbers in brackets indicates the occurrences of that word in the original word.

This doesn’t just apply in text as well (although one of the most popular applications is in your smart phone where there’s predictive text), it can be used for any scenario where you use historical information to define next steps for a given state. For example, you could codify a given stock market pattern (such as the % daily changes for the last 30 days), then use that to see historically what was the likely next day outcome (example only.. I’m very doubtful how effective it would be).

Why are they so fun?

I’ve always wanted to build a text generator as it’s just an awesome way to see how you could mimic intelligence using a very cheap shortcut. You’ll see the algorithm below, and it is super simple. The other fact is that, like above example, you can use it to mix the ‘voice’ from two different persons and see the outcome.

How does it work?

There are two phases for text generation with Markov Chains. There’s first the ‘dictionary build phase’ which involves gathering the historical texts, and then generating a dictionary with the key being a given word in a sentence, and then having the resultant being the natural follow-up words.

Here you can see the original sentences were broken down into words and the included the subsequent words with a counter to indicate number of occurrences. Note that full-stops are also included.

The second is the execution, where you start from a given word, then use that word to see what the next word would be in a probabilistic way. For example:

Traversing the dictionary to generate text

Now, there are some tricks which you need to be mindful of ( I found this out the hard way):

You can’t start from any random word — if you do, then you’ll get sentences like this: “ ate the cat.” . You have to keep track of “starting words” to keep things simple — hence you can have: “John ate the cat” .
Don’t ignore punctuation— if you do remove punctuation, you’ll get sentence like this: “ The dog barked at John cat”. Instead keep them there so that you can have a better chance to have a more realistic sentence — i.e. “The dog barked at John’s ca ”
End on a full-stop word. When you go through and start from a word, then find the next word, then find the next word and so on, you can continue until you reach a specified length, but then you’ll end up stopping in mid-sentence such as this: “ The cat ate John’s”. Instead, simply end when you have a word that has a full stop (another reason not to remove the punctuation) — i.e. “The cat ate John’s boots. ”

Source texts

I played around with different texts including: Eddie Murphy stand-up routines, Donald Trump tweets, Obama speeches, and Jed Bartlet dialogue. You can find them all here. It’s great to use one source and then generate the dictionary, but then you can mix and match and use two sources (e.g. Obama and Bartlet) and then create the one dictionary file. Then when you traverse the dictionary you get the both voices.

It is important to make sure that you can balance the text — e.g. if you had a 8000 text from Obama and only 1000 text from Eddie Murphy, it’s likely that you would see more of the Obama words. Of course, when you build the dictionary, you can also add some artificial weighting towards the lighter text source to balance things out.

Conclusion

It’s not perfect — you’ll see when you create your own, that some text is just gibberish. The more text that you have the better. Secondly, using single words is not helpful in the dictionary — you should use groups of 2–3 words. The actual number depends on how much historical text you have.

You can find all the code and source texts here. Good luck!

Thanks for reading! If you like what you read, give a clap below so that others may find this (you can also find me on Twitter_ )_