DEV Community

Dian Fay
Dian Fay

Posted on • Originally published at di.nmfay.com on

Surrealist Remixes with Markov Chains

There's a new button at the bottom of this (and each) post. Try clicking it! (If you're reading this on dev.to or an RSS reader, you'll need to visit di.nmfay.com to see it)

By now everyone's run into Twitter bots and automated text generators that combine words in ways that almost compute. There's even a subreddit that runs the user-generated content of other subreddits through individual accounts which make posts that seem vaguely representative of their sources, but either defy comprehension or break through into a sublime silliness.

People have engaged in wordplay (and word-work) for as long as we've communicated with words. Taking language apart and putting it back together in novel ways has been the domain of poets, philosophers, and magicians alike for eons, to say nothing of puns, dad jokes, glossolalia, and word salad.

In the early 20th century, artists associated with the Surrealist movement played a game, variously for entertainment and inspiration, called "exquisite corpse". Each player writes a word (in this version, everyone is assigned a part of speech ahead of time) or draws on an exposed section of paper, then folds the sheet over to obscure their work from the next player. Once everyone's had a turn, the full sentence or picture is revealed. The game takes its name from its first recorded result: le cadavre exquis boira le vin nouveau, or "the exquisite corpse shall drink the new wine".

The Surrealist seeds fell on fertile ground and their ideas spread throughout the artistic and literary world, just as they themselves had been informed by earlier avant-garde movements like Symbolism and Dada. In the mid-century, writers and occultists like Brion Gysin and William Burroughs used similar techniques to discover new meanings in old texts. The only real difference in our modern toys is that they run on their own -- it's a little bit horror movie ouija board, except you can see the workings for yourself.

There are a variety of ways to implement this kind of functionality. On the more primitive side, you have "mad libs" algorithms which select random values to insert into known placeholders, as many Twitter bots such as @godtributes or @bottest_takes do. This method runs up against obvious limitations fairly quickly: the set of substitutions is finite, and the structure they're substituted into likewise becomes predictable.

More advanced text generators are predictive, reorganizing words or phrases from a body of text or corpus in ways which reflect the composition of the corpus itself: words aren't simply jumbled up at random, but follow each other in identifiable sequences. Many generators like these run on Markov chains, probabilistic state machines where the next state is a function only of the current state.

Implementing a Textual Markov Chain

The first order of business in using a Markov chain to generate text is to break up the original corpus. Regular expressions matching whitespace make that easy enough, turning it into an array of words. The next step is to establish the links between states, which is where things start getting a little complex.

Textual Markov chains have one important parameter: the prefix length, which defines how many previous states (words) comprise the current state and must be evaluated to find potential next states. Prefixes must comprise at least one word, but for the purposes of natural-seeming text generation the sweet spot tends to be between two and four words depending on corpus length. With too short a prefix length, the output tends to be simply garbled; too long a prefix or too short a corpus, and there may be too few potential next states for the chain to diverge from the original text.

Mapping prefixes to next states requires a sliding window on the array. This is more easily illustrated. Here's a passage from Les Chants de Maldoror, a 19th-century prose poem rediscovered and given new fame (or infamy) by the Surrealists, who identified in its obscene grandeur a deconstruction of language and the still-developing format of the modern novel that prefigured their own artistic ideology:

He is as fair as the retractility of the claws of birds of prey; or again, as the uncertainty of the muscular movements in wounds in the soft parts of the lower cervical region; or rather, as that perpetual rat-trap always reset by the trapped animal, which by itself can catch rodents indefinitely and work even when hidden under straw; and above all, as the chance meeting on a dissecting-table of a sewing-machine and an umbrella!

Assuming a prefix length of 2, the mapping might start to take this shape:

"He is": ["as"],
"is as": ["fair"],
"as fair": ["as"],
"fair as": ["the"]
Enter fullscreen mode Exit fullscreen mode

Starting from the first prefix ("He is"), there is only one next state possible since the words "He is" only appear once in the corpus. Upon reaching the next state, the active prefix is now "is as", which likewise has only one possible next state, and so forth. But when the current state reaches "as the", the next word to be added may be "retractility", "uncertainty", or "chance", and what happens after that depends on the route taken. Multiple next states introduce the potential for divergence; this is also why having too long a prefix length, or too short a corpus, results in uninteresting output!

Because the prefix is constantly losing its earliest word and appending the next, it's stored as a stringified array rather than as a concatenated string. The order of operations goes like this:

  1. Select one of the potential next states for the current stringified prefix array.
  2. shift the earliest word out of the prefix array and push the selected next word onto the end.
  3. Stringify the new prefix array.
  4. Repeat until bored, or until there's no possible next state.

Remix!

If you're interested in the actual code, it's remix.js in devtools, or you can find it in source control.

Markov chain generators aren't usually interactive; that's where the "probabilistic" part of "probabilistic state machine" comes into play. This makes the implementation here incomplete by design. Where only one possible next state exists, the state machine advances on its own, but where there are multiple, it allows the user to choose how to proceed. This, along with starting from the beginning instead of selecting a random opening prefix, gives it more an exploratory direction than if it simply restructured the entire corpus at the push of a button. The jury's still out on whether any great insights lie waiting to be unearthed, as the more mystically-minded practitioners of aleatory editing hoped, but in the mean time, the results are at least good fun.

Top comments (3)

Collapse
 
dmerand profile image
Donald Merand

I love this! What's great to me about this article (and so many of your other articles) is how you're taking an already-interesting idea like probabilistic sentence fragment generation using Markov chains and then recontextualizing into the world of surrealist art. There's so much to be gained by reading this through!

I made a similar thing in Ruby, but focused on individual words instead of sentences: rubygems.org/gems/markov_words/ . I was interested in generating passwords that sound like English so that they're easy to remember but hard(er) to guess with dictionary attacks. I discovered a similar thing about prefix length, although with single-word Markov-generators what you get baby talk on one end, and complete words with strange endings on the other.

Collapse
 
dmfay profile image
Dian Fay

Thanks! That's an interesting use for Markov chains, although I think it's a war password managers have won overall due to the sheer volume of accounts people have to make everywhere. You can't get them in everywhere though so there'll always be a call for backup strategies.

I remember reading ages ago on The Daily WTF from someone who implemented a password generator with a similar thought about memorability, using romanized Japanese phonemes due to their consistent pronunciation -- only to discover well after deployment that the Anglophone userbase found the occasional juxtaposition of pairs like "shi" and "te" memorable for less than entirely desirable reasons. As far as I can recall that was purely random, although the Scunthorpe problem could come up with predictive output too.

Collapse
 
dmerand profile image
Donald Merand

The thing I didn't mention is that I may have also implemented my own password manager for my org which uses my password generator API :) The service fees on password managers are outrageous for group usage! But mine is decidedly less well-integrated into every single app platform. Still, it works well for sharing passwords.

I also definitely have the problem with inadvertently generating words that are not exactly things you'd want people shouting in the computer lab. I hadn't heard about the Scunthorpe problem, but that makes a lot of sense. So what we've learned is that if you give an army of monkeys typewriters, they won't write Shakespeare very often, but they'll spout out something profane every few minutes :)