Tariq Ali

Posted on Dec 23, 2016

NaNoGenMo 2016 and My Predictions About Text Generation

NaNoGenMo 2016

NaNoGenMo stands for National Novel Generation Month. Can you write code that can generate a 50,000-word novel in one month? Since it started in 2013, it has slowly gained a reputation for innovation in tackling a very hard problem (generating text at scale) in a variety of different ways.

NaNoGenMo 2016 appeared to be less active than NaNoGenMo 2015 or NaNoGenMo 2014. It had less issues and less stars overall. Media attention, once widespread and viral, has dropped off significantly - the only news article of note was Nerdcore's acknowledgement of the existence of the generative novel The Days Left Foreboding and Water. Coverage appears to be limited to sympathetic sources (usually from other programmers).

However, a sea change in attitudes appear to be occuring beneath the surface of low activity. Knowledge about generative text has been formalized in blog posts (such as the ones that I have written before on dev.to, but also a Russian-language blog post) for easy reference. More newbies seemed eager and passionate to participate. Computer science instructors are giving their students extra credit if they submit an entry. One class made it a required assignment to submit a generative book. The computer-generated prose is now much more readable and interesting, though more work is still necessary.

The field of text generation is consoldiating, and possibly improving for the better. Which leads to a question...what happens next in the field? Looking at what happened in NaNoGenMo 2016, I am willing to venture to make 3 predictions...

Statistical Approaches Will Be Part Of A 'Toolkit' In Text Generation

During NaNoGenMo 2016, Vincent Toups posted a Tweetstorm examining two major approaches used in in text generation - the "statistical" approach and the "structural modeling" approach. I've focused a lot on "structural modeling"...but less so on "statistical" approaches (e.g., machine learning).

["The statistical" approach] is an attempt to create an ad-hoc generative model w/ a data set. The generative approach is broad: recurrent neural networks or word vectors or markov models are tools used in this approach.

Vincent Toups points out two issues associated with the "statistical" approach:

The statistics of a given novel are very bad. Most novels contain a series of distinct sentences. Even for a loose definition of "distinct". As such, if you are trying to learn a model for sentences, you don't have enough data.

... The second major issue for naive models is structure: novels are not sequences of sentences. They encode complex, large, correlations. Most notably, plot. But style, characterization, etc all live at scales which naive models have trouble characterizing.

Statistical approaches are good for presenting text that looks human-ish. It is very effective at producing evocative words, phrases, and sentences. However, the generated text quickly turns into gibberish with even short-term exposure. The generated text could not "scale" upwards to novel-length works.

The previous blog posts I have written about "text generation" deals with what Vincent Toups call "structural modeling". The idea with modeling is that the programmer writes code that 'encodes' the plot/style/characterization at a large scale. Even these models have their weaknesses and faults, but they do appear to produce far more readable text than "statistical" approaches. As Vincent points out -

[They] creep towards non-generated works, since the author is very present in them.

While Vincent points out that limits of both the "statistical" approach and the "structural modeling" approach, he also expressed hope of a fusion between the two:

I think a genuinely successful attempt will involve hybrid approaches: models for story and arc, statistical techniques for style. Even then, I suspect the results will lack a human feel. Literature, after all, is a complex communicative act between humans.

In 2016, a few programmers have begun moving towards this hybrid approach, using "statistical approaches" and "structural modeling" to generate novels:

Isaac Karth used Textacy, a Python module intended for high-level 'natural langauge processing', to search within a Gutenberg corpus for key sentences that can be combined together into templates. These templates would then be used to generate scenes for his Pirate Novel. Isaac, however, admitted to also using 'grep' to find key sentences as well.
Superphreak uses sentence similiarty to generate 'workplace gripes'...a sentence would be randomly chosen from a corpus, along with a few similar sentences. This produces a coherent mini-rant, as each sentence in the mini-rant relate to each other.
Fire Up The Pulp Mill! generates short stories set in fantasy universes. The algorithm uses Markov chains to generate the name of various locations that characters can explore, and uses templates and combat simulations to explain what the characters actually do at those locations.
Annales is a novel about the history of a mighty country. It uses a Neural Network to generate "not just names but creatures, adjectives, abstract nouns, weapons, food and drink, and so on". The generated words were filtered to match certain criteria (for example - the name of all women must end with the letter 'a'). According to Mike Lynch, using neural networks to generate words "made the output more mysterious and got rid of a lot of the jokey anachronism of early drafts".

It is likely that this "hybrid approach" may be fruitful in the coming years. By taking advantage of both modeling and statistical techinques, the programmer will get the 'best of both worlds'.

Corpus Curation Will Be A Necessary Skill In 'Text Generation'

Can a human write a story without any knowledge of his world whatsoever? With no concepts, ideas, even language? I doubt that it is possible. Humans require socetial input to be able to generate meaningful output. Machines require input as well, in the form of a corpus.

It is not a concidence that the current hype in AI coincides with the rise of Big Data. Machines are able to process and manipulate data much more effectively than a human can, thereby making them suitable for generating useful statistical predictions and analyses about the world (even as one concedes the machines don't really "understand" the world).

The text-generation equivalence of data is the 'corpus' - written text that is fed into a computer. The computer can stitch together text from a corpus to generate a story, just like a human uses letters from a language to generate a story. Corpus is required for text generation, even if the corpus is as basic as templates. Even machine learning algorithms such as RNNs require some kind of pre-existing corpus to train on. A program requires a corpus to write effectively (just like humans require pre-existing knowledge to write effectively).

The upshot is that the programmer must also engage in "corpus curation". The programmer must search for text that can then be used by the algorithm to generate stories. The process of finding a corpus can be difficult and time-consuming, so many NaNoGenMo programmers rely on Gutenebrg and corpora for their text. If a specific corpus is not easily available, then the programmer must handwrite the corpus, a rather painful process.

Generally, as the corpus grows larger, the generated output becomes more varied and interesting. This means that the best way to improve a text generator is to simply gather more data. Of course, the process of gathering data is itself an expensive endeavour, forcing programmers to be more creative in how they use the corpus they already have.

Examples From 2016 NaNoGenMo:

Superphreak is a novel where the main character dives into dumpsters to find valuable newspaper clippings and technical manuals about his world. Whenever a dumpster dive is successful, new pieces of the corpus gets shown to the user.
The Track Method places passages from Gutenberg right next to each other to generate a human-readable 13,606-word novella about a secret agent living in an underground complex.
If on a winter's night a library cardholder is a story about a person trying to find a specific book from the the Gutenberg corpus. Whenever the main character opens up a book, you see a random passage straight from that book, before the character decides whether to keep the book or return it to the shelves.

The Goal Of The Programmer Will Be To 'Scale' Novel Experiences, Not To Save Money

The traditional narrative about automation is that machines are able to do the same job that a human can, but at a much cheaper rate. As a result, the human is likely to lose employment since the automation is more productive.

While I subscribe to this traditional narrative for most industries, it is a bit harder to apply it to generative text. Writers' wages are already very low to begin with (and nonexistent if you include volunteer bloggers and social media users). Any efficiency gains for switching from cheap human labor to cheap automated labor must be limited...and that is if you ignore the initial setup costs. Cost-savings may exist in some form, but it will not be why people will use computers to write.

Instead, computers has the advantage of 'scaling' up interesting ideas. A human may balk at the idea of writing personalized sonnets for a thousand people at a moment's notice, generating dialogue trees for hundreds of different characters in a cyberpunk video game, or writing twelve horrible James Bond puns. But a computer doesn't care. Code is just a way of formalizing processes, and the computer enables a programmer to turn a creative idea into a physical, final product.

There's a limit to how far computers can 'scale' up those interesting ideas though. It is very easy to produce infinite variation, but humans are good at pattern matching and will easily discover the patterns in your generation. Once humans discover the patterns, then they will likely get bored. Human literature has patterns too, but the patterns of a machine is likely to be more simplistic and easier to identify and pin down.

The clever programmer can work around this limitation by developing a more intereting pattern, creating multiple patterns, or increasing the corpus by which the machine draw from...but that requires just as much work as hand-crafting the content. This is why Orteil, the developer of Cookie Clicker, tweeted:

thanks to procedural generation, I can produce twice the content in double the time

Still, computers can tell stories that humans have never even bothered to think about telling before. By providing new and interesting experiences, computer-generated literature might be able to secure a worthwhile niche. Possibly this may be all it can hope for, as machine-generated literature becomes its own unique genre, with its own dedicated fanbase (while most humans are happy with human-generated works). But it is also possible that the appeal of novelty may enable machine-generated literature to reach mainstream status.

Examples From 2016 NaNoGenMo:

The Days Left Foreboding and Water engages in 'blackout poetry', where certain words are highlighted in text while all other words are blacked out.
Pursuit - A Paranoid Thriller is a novel about an increasingly frantic chase scene, with political coverups, shady
meetings, and insecure chatrooms.
Dear Santa uses Tweets to write "[a] contemporary epistolary novel" about people's desires.
A Primer In Gont Poetics is an academic textbook about humans attempting to translate and understand alien poetry.

Appendix

This blog post is computer-generated, using a handwritten corpus. Source code is here. I use the Ruby gem Prolefeed to generate this blog post.

Previous Articles In Text Generation

Top comments (2)

Ben Halpern • Dec 23 '16

Your "this blog post is computer-generated" appendix gets me every time. 😄

I feel the publishing industry has created a model where they try to extract maximum value out of their human bloggers and treat them like assembly-line workers. There's not a lot of value in "aggregating" stories. It's just publishers trying to eek out a slice of the pie. A lot of this work will certainly be automated in the future.

I feel like the quantity over quality model that Internet publishers have fallen into is flawed for the business and the consumer because there is so much replicated work and it does not take advantage of economies of scale very well. In the future, there may be fewer overall writing jobs, but they will leverage human creativity much more than the current landscape.

Before the content-farming blog jobs totally go away, they will trend in the direction of editorial oversight into a more automated process. This part will probably vanish eventually as well.

Tariq Ali • Dec 24 '16

I agree completely that the current ecosystem is perverse and unsustainable. To add a bit more content to this comment though, you would probably be very interested in the Content Shock hypothesis, which argues that the users' demand for more content will ultimately halt (meaning that incumbents who are 'first to market' with content ends up dominating, while new contenders who are trying to push out content gets ignored). In November 2015, the author claimed that content shock has arrived. Surprisingly, the author believe that one way for marketers to handle "Content Shock" is to mass-produce content, thereby drowning out the competitors, and believe that this mass-production will likely be done by computers.

If this content ecosystem does collapse, then the amount of text generation (human or automated) will probably decrease. It's an open question whether the humans that remain in that line decide to embrace automation (and use the content wreckage as corpus) or to shun it.