DEV Community: Tariq Ali

Statistical Approaches Are Actually Pretty Good - A "Mea Culpa" Regarding One Of My Predictions About Text Generation

Tariq Ali — Wed, 17 Jun 2020 21:25:31 +0000

For some time, I have written blog posts about text generation, and even went so far as to collect them into an online ebook. In one of the chapters of said e-book, written in 2016, I made predictions about the future of text generation. One of those predictions was "Statistical Approaches Will Be Part Of A 'Toolkit' In Text Generation".

Statistical approaches are good for presenting text that looks human-ish. It is very effective at producing evocative words, phrases, and sentences. However, the generated text quickly turns into gibberish with even short-term exposure. The generated text could not "scale" upwards to novel-length works.

That didn't mean statistical approaches (e.g., machine learning) were worthless. Indeed, the fact that it can produce evocative text is, well, pretty cool. But you would need to combine it with another approach ('structural modeling', e.g. hardcoded text generation logic) - essentially using machine learning as part of a 'toolkit' rather than as a sole thing.

That's was true back in 2016. It's still true now. My problem was assuming that it would always be true...and only now do I realize my error.

It's like predicting in the 1850s that people will never be able to develop airplanes. He would be correct in the 1860s. And even in the 1870s. But one day, airplanes did get invented, and your prediction would be proven wrong.

When OpenAI announced in February 14th 2019 that it has created a neural network that can generate text (the infamous GPT-2), I was flabbergasted at how coherent it was - even if the examples were indeed hand-selected. Sure, there were still flaws with said generated text. But it was still better than what I originally expected. In a GitHub issue for my own text generation project, I wrote at the time:

Even though there are still subtle flaws in the neural network they have right now, those subtle flaws can be fixed given enough time and resources. OpenAI have demonstrated what is possible, and what is possible will wind up being inevitable.

Of course, I didn't post a "Mea Culpa" back then. Just because something is inevitable doesn't mean that it will happen immediately enough for it to be practical.

One month after OpenAI's announcement of GPT-2, Rich Sutton would write a blog post entitled The Bitter Lesson.

The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

Of course, Rich intended for this paragraph to advocate in favor of "scaling computation by search and learning" (the "statistical approach")...but I could also read it in the opposite direction - that if you want to have results now, not in some unspecified time in the future, you want to build knowledge into your agent ("structural modeling" or a hybrid approach). Yes, your technology will be declared obsolete when the "statistical approach" prevails, but you don't know when that happens, and your code is going to be obsolete anyway due to advances in technologies. So embrace the Bitter Lesson. If you want to write a computer program to write text now...well, then do it now! Don't wait for innovation to happen.

Then, on May 28th 2020, OpenAI announced that it built GPT-3, which was a massive improvement over GPT-2, fixing several of its faults. The generated text does turn into gibberish, eventually...but now it takes medium-term exposure to expose its gibberish nature, which is remarkable to me. Interestingly, it's also very good at "few-shot learning"...by providing the generator with a few examples, it can quickly figure out what type of text you want it to generate and then generate it. Here's a cherry-picked example:

Human: A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses
the word whatpu is:

Human: We were traveling in Africa and we saw these very cute whatpus.

Human: To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:

GPT-3: One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles.

Human: A "yalubalu" is a type of vegetable that looks like a big pumpkin. An example of a sentence that uses the word yalubalu is:

GPT-3: I was on a trip to Africa and I tried this yalubalu vegetable that was grown in a garden there. It was delicious.

Huh. So I guess people should have waited. Technology is indeed progressing faster than I expected, and in future text generation projects, I can definitely see myself tinkering with GPT-3 and its future successors rather than resorting to structural modeling. It's way more convenient to take advantage of a text generator's "few-shot learning" capabilities, configuring it to generate the text the way I want it to - rather than writing code.

OpenAI thinks the same way, and have recently announced plans to commercialize API access to its text generation models, including GPT-3. The 'profit motive' tends to fund future innovation in that line, which means more breakthroughs may occur at a steady clip.

I do not believe that GPT-3, in its present state, can indeed "scale" upwards to novel-length works on its own. But I am confident that in the next 10-20 years[1], a future successor to GPT-3 will scale, all by itself. I believe that statistical learning has a promising future ahead of it.

[1] I'm a pessimist, so I tend to lean towards predicting 20 years - after all, AI Winters and tech bubbles are still a thing. But, I am also a pessimist about my pessimism, and can easily see "10 years" as a doable target.

A Review of "The Discovery Engine" - A Short Story About the Future of Content Generation

Tariq Ali — Tue, 30 May 2017 18:48:16 +0000

In December 2016, Ben Halpren and I got in a discussion over the future of publishing.

Ben Halpren - I feel the publishing industry has created a model where they try to extract maximum value out of their human bloggers and treat them like assembly-line workers. There's not a lot of value in "aggregating" stories. It's just publishers trying to eek out a slice of the pie. A lot of this work will certainly be automated in the future.

I feel like the quantity over quality model that Internet publishers have fallen into is flawed for the business and the consumer because there is so much replicated work and it does not take advantage of economies of scale very well. In the future, there may be fewer overall writing jobs, but they will leverage human creativity much more than the current landscape.

Before the content-farming blog jobs totally go away, they will trend in the direction of editorial oversight into a more automated process. This part will probably vanish eventually as well.

Tariq Ali - I agree completely that the current ecosystem is perverse and unsustainable. To add a bit more content to this comment though, you would probably be very interested in the Content Shock hypothesis, which argues that the users' demand for more content will ultimately halt (meaning that incumbents who are 'first to market' with content ends up dominating, while new contenders who are trying to push out content gets ignored). In November 2015, the author claimed that content shock has arrived. Surprisingly, the author believe that one way for marketers to handle "Content Shock" is to mass-produce content, thereby drowning out the competitors, and believe that this mass-production will likely be done by computers.

If this content ecosystem does collapse, then the amount of text generation (human or automated) will probably decrease. It's an open question whether the humans that remain in that line decide to embrace automation (and use the content wreckage as corpus) or to shun it.

The current "quantity over quality model" that exists within content generation exist primarly due to Google's algorithms favoring more content:

While one high-quality article might drive a thousand shares, 10 articles that drive 120 shares each is more. Replace shares with traffic or conversions. It’s the same concept. In this way, Google is actually encouraging us to commoditize our content in lieu of creating great content, whether it’s purposeful or not.

Or, to put it more plainly:

Winning in digital media now boils down to a simple equation: figure out a way to produce the most content at as low a cost as possible.

This does not sound sustainable, to me or to Ben. The only reason consumers haven't yet given up on the Internet though is our reliance on various curators (human aggregation, search engine bots, machine learning recommendation systems, etc.) to tell us what content to consume. But what about content producers? What happen to them if they are unable to produce enough content that appeals to the curators? It seems that they have to shut down. And when they do, the content ecosystem would implode.

There is, of course, an alternative viewpoint that contrasts with both Ben's and mine's, one that argues that the content ecosystem is sustainable, that Content Shock is no problem at all, and that far from the ecosystem collapsing, is bound to explode and proliferate.

This viewpoint argues that major content publishers will learn how to take advantage of "economies of scale" by reusing existing works. Content publishers will also learn how to 'personalize' content to match the whims and desires of the people who are reading it, thereby keeping them engaged. The content ecosystem will adapt, and the process of adaptation would lead to a new 'golden age' of human creativity.

That viewpoint is best expressed by the short story "The Discovery Engine".

"The Discovery Engine, (or, Algorithms Everywhere and Nowhere)" is one of two short stories written by Professor Christopher Ketly in the peer-reviewed article Two Fables. This short story imagines a world where the "myth of the Romantic Author" died. As a result, the US government reformed intellectual property laws, allowing for major corporations to remix and reuse content in brand new ways.

[A]fter all the big media companies had figured out how to Spotify everything, there was not much reason to hemorrhage money into controlling media, when the game was to extract value in other ways: discovering, sorting, searching, repackaging, and repurposing. Copyright enforcement fell away; with everything open, or so multiply licensed it was impossible to track, it became clear that real, new growth was in repackaging and reselling information rather than investing in the creation of new information. New information had margins way too low to invest in anymore, even if novelty still drove the market more than ever. Something about the old argument that supporters of copyright terms used to make—that copyright was an incentive for people to create—no longer seemed to even make sense. Innovation certainly had not died, nor had the music.

Innovations included, but were not limited to:

Repackaging public domain and 'orphan works' for modern-day audiences, through clever marketing techniques (Example: "Who in their right mind would sit through an entire Cassavetes movie when it was possible to watch it unfold as the crazy mashed-up backstory of an X-Men film?")
Using English-to-English machine-translation algorithms to change the readability of texts to match a user's desires
Personalizing manuscripts to tailor to an individual's personal biases and beliefs (for example, renaming characters and locations)...but not too much, because you still want people to discuss with each other about the same reading experience(s)
Programatically adding 'dark patterns' to books as people read them (such as cliffhangers, targeted marketing emails, and "literary versions of clickbait")....with the expressed purpose of encouraging people to keeping reading
Reusing exisiting assets to generate new versions of the same story to be resold at different times (similar to the concept of reusable compontents in programming)
Increasing discoverability of existing literature so that people can wind up finding assets that can be reused
...or skipping the discovery process and using RNNs to just generate the stories outright

Human creativity has been leveraged, and probably even expanded, in this dystopian future. Instead of tediously trying to come up with new words to express what has already been expressed better before, you can simply reuse what already exist and then tweak it to match your specifications. Creativity simply moves to a higher level of abstraction, similar to how programmers move from low-level languages (like Assembly) to higher-level languages (like COBOL). 'DRYness' has been applied to literature.

And there's no sign of technological unemployment. Instead, the content economy is itching to hire more people...

"Cool hunters" find the works to be repackaged and repurposed
"Lit geeks" identify current fads and trends in literature, enabling corporations to tailor their generated books to a mass audience
Privacy violators spy on people and gather data about their reading habits
Data scientists analyze the data and design the algorithms necessary to generate the stories
And, of course, marketers develop the initial ad campaigns to acquire potential readers

However, writers are nowhere to be found in this golden age of human creativity except in an aside describing the creation of a new TV series ("The Game of Thrones reboot"):

Writers were employed by the system to check these texts, but not to write them, just to tweak and improve them, insert clever jokes and Easter eggs, to look for the giveaways that the algorithms could not spot.

I find "The Discovery Engine" to be (a) a pleasant read and (b) a plausible prediction for the fate of all content generation (both human-made and machine-made). I do quibble with two of the author's points though.

First, while RNNs certainly have their place as part of a "toolkit" in text generation, I highly doubt that they have the potential to generate coherent stories by themselves. I could imagine a world where RNNs generate hundreds of different stories and then a secondary algorithm filters through the resulting junk to find one or two stories that may be worthwhile. Even in this imaginary state though, I'd still see a fleet of data scientists behind the scenes, trying their best to massage the data and the algorithms to produce the best possible outcome.

The second point I quibble with is more philosophical. One subplot in this sci-fi story involved society's reaction to multiple versions of the same public domain work being published by major corporations, leading to a reactionary (and doomed) impulse by a few scholars to recover and preserve the "original" works.

But in the real world, I don't think anybody would care. After all, the story of Cinderella has been rewritten countless times, with different versions personalized to different regions of the world. And we can sensibly discuss about the different versions of Cinderella (for example: arguing whether The Brothers Grimm's version is superior to Walt Disney's version) without getting ourselves confused. Ideas has always been cheap; it's the execution of those ideas that matters.

The New Work departments in the publishers and movie studios were eventually dwarfed by the Repurposing departments; it became increasingly impossible to sell a book or script just because no one had ever written something like it before. It turned out, actually, that this was always false, and it was far easier and cheaper to update Trollope than to pay Franzen for his novel on the same topic.

Backlink:
This article was originally published on May 28th 2017 on my personal blog.

How To Abuse the /docs "Feature" To Quickly Deploy a Middleman Static Site Onto GitHub Pages

Tariq Ali — Sat, 29 Apr 2017 22:20:11 +0000

"Middleman is a static site generator using all the shortcuts and tools in modern web development."---Middleman Home Page

One of those shortcuts is having a directory structure to help organize your project. Here's a short version of that directory structure (a more complete version is located in the docs).

mymiddlemansite/
-- .gitignore
-- Gemfile
-- Gemfile.lock
-- config.rb
-- source (all the files necessary for building the static site)
-- build (the generated static site files)

The idea is that you would write all your code in the source directory, type in bundle exec middleman build to generate your build directory, and then ship this build directory elsewhere to be served by a web server. Most likely, you would use a tool such as middleman-deploy or middleman-gh-pages to automate the deployment process. Both of these tools are able to natively support deploying to GitHub Pages, usually by pushing up the build directory to the gh-pages branch of a GitHub repository.

But what if you don't want to use those tools to deploy to a separate branch of your repository? What if you wanted your website files to be in the same branch as your Middleman source code, for maintainability and aesthetic reasons? GitHub Pages luckily gives you that option...though in a rather indirect and hacky fashion.

In the settings of your GitHub repository, you tell GitHub Pages to read from either the gh-pages branch, the master branch, or the /docs folder.

The last option is incredibly useful for open-source projects who wants a specific /docs folder to store all their documentation files for their projects. For example, the Classifier-Reborn open source project has a GitHub Page, and its source code is stored within the /docs folder.

So we're going to abuse this feature to deploy our own Middleman site. First, we need to select the master branch /docs folder option in the settings of our repo. Second, we need to rename /build to /docs. This can be done by editing config.rb to change the build directory's name.

configure :build do
  set :build_dir, 'docs'
end

Finally, we just type out...

bundle exec middleman build

...commit the resulting build folder to our repository, and then push the result up to our GitHub repository.

There is still one tiny flaw though. Any time you need to make a change, you must rebuild the /docs directory and commit the generated files. I leave it to the reader to decide how to best automate this process (probably by writing out a Raketask).

Using Machine Learning To Generate Human-Readable News Articles

Tariq Ali — Mon, 20 Mar 2017 06:49:54 +0000

TL;DR Abstract - I built ZombieWriter, a Ruby gem that will enable users to generate news articles by aggregating paragraphs from other sources. It can use either machine learning algorithms (Latent Semantic Analysis and k-means clustering) or randomization to generate human-readable articles. In this article, I demonstrate how ZombieWrtier can use machine learning to create a Markdown file containing 17 human-readable articles.

After finishing the demonstration and comparing the output to a randomization process, I then explain possible "future research" plans to improve the text generation process. I am not yet ready to claim that this technology is disruptive. But it's very close to being "production-ready".

Introduction

Machine learning is hot (to put it mildly). The paradigm of using data instead of code to program machines has been applied to solve a variety of real-world problems. It also been applied to the creative arts, with neural networks being used to generate paintings and music.

Text generation using machine learning is far less advanced though. Many famous attempts have been focused on training neural networks to generate text, and the output tends to be evocative in short bursts. However, these texts do not "scale" properly. Any coherence that a neural network generates by accident gets drowned out by the incoherence that it usually generates.

I have previously blogged about text generation. But I've been interested in "human-readable" text generation - text-generation that reaches the same quality standards of human literature. Machine learning algorithms were impressive, but they could not scale up well. So I usually ignored them in my blogging, in favor of "structural modeling" (which I defined as "writing code that 'encodes' the plot/style/characterization at a large scale").

Yet, technology marches onward. As machine learning becomes more popular and accessible, people are able to try different approaches and techniques to the problem of text generation. Some of these techniques actually were successful, when applied in conjunction with structural modeling. For example, Mike Lynch used neural networks to generate words for fantasy settings.

This march of technology should not be that surprising. An infinite number of monkeys with an infinite number of typewriters can write Shakespeare...similarly, an infinite number of programmers conducting an infinite number of experiments can make great discoveries.

A few months ago, I admitted that machine learning may play a role in the future of text generation, and made a prediction - "Statistical Approaches Will Be Part Of A 'Toolkit' In Text Generation". At the same time, I was putting the finishing touches on a secret project to use machine learning in my own text generation. (If you're curious in the nitty-gritty research details, here's some excerpts of a Slack conversation in September 2016 about my experiments.)

The final result of my project is ZombieWriter. According to its GitHub profile, ZombieWriter is "a Ruby gem that will enable users to generate news articles by aggregating paragraphs from other sources". You have to provide the paragraphs, but once it does so, ZombieWriter will arrange the paragraphs in a manner that appears as if they're news articles.

While you can download the gem and try it out yourself, knowing how it works may be more interesting. Here's the step-by-step guide for how ZombieWriter's Machine Learning algorithm works.

Step One - Latent Semantic Analysis

Latent Semantic Analysis is a machine learning algorithm that is used for determining how similar paragraphs are to each other. It was invented in the late 1980s, and had generally been used for "information retrieval".

LSA works by creating "bags-of-words" containing all the important words in a paragraph, and then comparing the different "bags-of-words" to determine their 'similarity'. Rather than deal with the math (because I don't quite understand it myself), here's a visual example, with four documents.

document	bag-of-words
A	cat, meow
B	cat, feline, pet
C	pet, food
D	feline, lion

Document A and Document B are very similar to each other, because they both share the same word ("cat"). Document B and Document C are similar to each other because they share the same word ("pet"). Document B and Document D are similar to each other because they share the same word ("feline"). Etc., etc.

One interesting feature about LSA is that it is able to determine relationships between different words based on their presence in documents...which also helps it identify similarities.

For example, Document B contains the words "cat" and "feline", which suggest that both words are related to each other. Since Document D contains the term "feline", Document D is obviously similar to Document B (which contains "feline"). Document D is also similar to Document A, since Document A contains the word "cat", and "cat" is related to "feline".

For more information on the mathematics behind LSA, please review the blog post Latent Semantic Analysis in Ruby. I used Classifier-Reborn, a Ruby gem, for conducting LSA of my paragraphs.

Step Two - K-Means Clustering

Once you determine what paragraphs are similar to each other, you need to group similar paragraphs together, using an approach called "clustering". Clustering tends to be used for exploratory data analysis; you have a lot of data that you want to easily understand, so you have an algorithm separate the data into different groups.

There are many different ways to clustering data, but the most popular approach to clustering is "k-means clustering" (k stands for the number of clusters you want). k cluster centers are created, and the algorithm move the cluster centers around to minimize the distance between the cluster centers to each element within the cluster. Anil K. Jain wrote an excellent paper outlining the various different approaches to clustering, and provided excellent visualizations, including this example of k-means clustering with an arbitrary dataset:

I chose k-means clustering because it was a tried-and-tested approach to clustering, meaning that all its strengths and weaknesses are pretty well-documented. To implement k-means clustering, I used the KMeansClusterer ruby gem.

Step Three - Headlines

Finally, we need to generate a headline for each cluster of paragraphs, to make the clusters look less like a collection of similar paragraphs and more like news articles. There was three approaches I could take:

Find key words within the paragraphs and use them outright as the headline. (Highscore Ruby gem.)
Look for key phrases within the article and just regurgitate those key phrases. (N-Gram Ruby gem.)
"Summarize" the article by picking the "most important sentence" in the article. Use that summary as your headline (epitome Ruby gem.)

The last approach produced the most readable headline with the least amount of human intervention, so I chose that approach. However, I then realized that Classifier-Reborn also had its own built-in Summarizer. To avoid taking on unneeded dependencies, I decided to use Classifer-Reborn's summarizer instead of "epitome".

Classifier-Reborn uses LSA for its summarization. It first breaks up the "cluster" into sentences and then choose the sentence has the highest similarity when compare d to all the other sentences. So, let's look at my table again, this time treating each "document" as a sentence in a larger article.

document	bag-of-words
A	cat, meow
B	cat, feline, pet
C	pet, food
D	feline, lion

Document B appear to have the most similarity, as it shares the same words as all the other documents in the set (A, C, and D). Therefore, it is likely that we would pick Document B and treat it as our headline.

A Demonstration of ZombieWriter

National Novel Generation Month (or NaNoGenMo) is a very interesting competition, with a lot of commentary over it. Rather than write our own commentary, why not simply reuse and remix existing commentary?

For this demonstration, I first prepared a CSV file nanogenmo.csv. This CSV file contains 90 paragraphs about the NaNoGenMo (National Novel Generation Month) competition, found all over the Internet. To ensure that we don't engage in plagiarism, we will always provide proper attribution to each source.

Then, I write a small script to parse this CSV, include it into ZombieWriter and then generate some news articles in an external file. ZombieWriter will also include proper citations, linking back to the original sources.

require 'zombie_writer'

#I like to use the smarter_csv gem as my CSV parser. It's a personal taste of mine.

require 'smarter_csv'

zombie = ZombieWriter::MachineLearning.new

array_of_paragraphs = SmarterCSV.process("nanogenmo.csv")

array_of_paragraphs.each do |paragraph|
  zombie.add_string(paragraph)
end

array = zombie.generate_articles

File.open("articles.md", "w+") do |f|
  array.each { |article| f.puts("#{article}\n- - -\n\n") }
end

Here is the Markdown file containing 17 different articles about NaNoGenMo. And here's an example generated article:

0 - Computers are coming for our jobs, as writers

Creative and artistic feats are often seen as the last refuge for human endeavor from the coming robot apocalypse. But if NaNoGenMo gains a foothold and improves, at least we'll all be well entertained in our unemployment.---"Computers Write Novels Faster Than You Do", Smithsonian Maganize

Computers are coming for our jobs, as writers. Spooky.---Tom Trimbath

But maybe this contest just reflects our evolution towards a more technological society. For the last 16 years November has seen “National Novel Writing Month” (or NaNoWriMo), a free event challenging amateur writers to compose a 50,000-word novel before December 1st. But two years ago it was suddenly joined by this companion event for artistically-inclined computer programmers, dubbed NaNoGenMo – drawing some suitably geeky jokes on Twitter. It's an irresistible challenge for a certain kind of creatively-inclined geek “This sounds like a great idea…” read one of the responses to the contest announcement. “How can anyone not take part?”---Computers Get Busy for National Novel-Generating Month

For programmers, there are many interesting things about NaNoGenMo even if no breakthroughs in AI are expected. (The point of an exercise like this isn't that it's done well, but that it's done at all. A month is not enough to build a robust system, but it is enough to experiment with and prototype new approaches for generating fifty thousand words of intelligible text. The value of a compressed time frame for experimentation is something that the participants of NaNoWriMo can well appreciate.)---Another Word:Let's Write a Story Together, MacBook

Limitations

While the generate articles' quality tends to be pretty good, there are a few problems with this demonstration that do require notice.

It takes 15 seconds to generate those 17 articles. This probably isn't too bad for someone well-versed in machine learning, but it is somewhat shocking to a newbie who has to wait a while for results to pop up.
The articles' sizes can vary, from being very short (1-2 paragraphs) to very long (11 paragraphs). I may have to break up some of the larger clusters so that the length of each article appear more equal.
The headlines may need work. Some headlines can become pretty long (since they are using very long sentences), meaning that I tend to skip the headline half-way through and just start reading the generated article. It's likely that if I was to use this to generate my own articles, I would probably just handwrite my own headlines.

One headline, in particular, was really odd - "For instance, here is Mr". This oddity can be traced to ClassifierReborn's approach to summarization - when it breaks an article up into individual sentences, it uses periods to indicate when a sentence has ended. "Mr.", obviously, had a period.

Randomization

ZombieWriter can also generate news articles by using randomization. Instead of bothering with machine learning to create clusters, ZombieWriter can just randomly pick paragraphs to put in each cluster. Using randomization in ZombieWriter is really easy:


zombie = ZombieWriter::Randomization.new

I wanted to test whether machine learning would produce better articles than randomization, so I also used the same nanogenmo.csv file with ZombieWriter::Randomization. Here's a Markdown file with 17 articles, using random generation.

The articles generated using machine learning appear to be of slightly higher quality than the articles generated using randomization. This may be because the articles generated using machine learning are likely to appear more 'coherent', as the paragraphs within the article usually discuss the same themes.

However, randomization is faster than machine learning, able to generate 17 articles in less than a second. Randomization also relies on less lines of code, meaning that it's much easier for a programmer to maintain.

That being said, the increased quality might outweigh the slower speed and the higher maintenance cost. So machine learning outweighs randomization, for now.

Future Research

There are certainly issues that have to be fixed, such as fixing bugs, writing unit tests, and porting the library over to Python. And that type of maintenance work matters a lot. There's also the need to explore new ways to improve the output of ZombieWriter, so that the generated texts becomes "production-ready" -- able to used in a wide variety of different industries.

Obviously, it's much easier to write about what should be done rather than to actually do it. Still, here's some ideas I'm interested in:

Gather the Source Data More Efficiently - ZombieWriter needs a CSV file containing the paragraphs it needs to remix as it sees fit. I was able to manually gather 90 paragraphs, but it was a tedious and time-consuming process. For ZombieWriter to successfully scale, we need to either pull paragraphs from an API (Reddit/Hacker News comments?) or write a new script to search online for news articles and extract paragraphs from these articles. Care must be taken to ensure that we provide proper attribution to whoever wrote the content.
- For those who prefer to generate fiction instead of nonfiction, you could use a neural network to generate a bunch of nonsense text, and use that as the source data for ZombieWriter. ZombieWriter will then arrange the nonsense text into different clusters, and since the nonsense text in each cluster will share certain similarities, the resulting prose might appear more coherent.
Generate Better Headlines - Google experimented with headline generation using neural networks and blogged about their experience on August 24th 2016. What makes these generated headlines interesting is that they are attempting to engage in "abstractive summarization" (rephrasing the article) instead of "extractive summarization" (using exact words from the article). Their headlines therefore read more fluently than my current approach.
Rewrite The Generated Articles - Right now, ZombieWriter is simply quoting from other people. But while human writers do rely on quotes from other people, they don't usually just regurgitate the quotes. They want to usually restate what other people say "in their own words". Could ZombieWriter behave in the same manner? This is actually a "machine translation" problem -- the goal is to translate an existing work into a new 'manner of speaking'. People interested in this problem may want to look at Tensorflow-Shakespeare, which is a neural network that can translate modern English to Shakespearean English, and vice versa. A similar type of neural network may be useful here as well.
Explore Different Ways of Determining Similarities and Clustering - For example, probabilistic latent semantic analysis may produce much better results than normal Latent Semantic Analysis. Note though that this task is of a lower priority though, since any quality improvement that might be gained by improving Step 1 (similarity determination) and Step 2 (clustering) might be fairly limited and not worth the effort.
- One interesting idea that I heard of recently is to "cluster" paragraphs based on dissimilarity instead of similarity - to take radically different paragraphs and place them right next to each other. Presenting various opinions on a topic may be a good way to counteract the "filter bubble" and introduce new ideas to the readers.

Of the ideas I'm interested in, "Gather the Source Data More Efficiently" is the highest priority. The best way to improve a text generator is to expand its corpus, after all. After acquiring enough data though, it will be time to look at ways of utilizing that data more effectively.

Appendix

Unlike my previous blog posts on computer-generated literature, this blog post is human-generated. This may be due to its nature as a technical "how-to" tutorial, where the order of paragraphs matters a lot. While text generation is slowly advancing, it still has a long way to go.

Ebook on Text Generation

In the past, every time I wrote about text generation, I would provide links to all my previous articles on dev.to. But this section just grew larger and larger. Using the tagging system wouldn't really work since the first two articles were written before the introduction of the tagging system.

Instead of providing more and more links, I created an eBook that contained all my articles...as well as a link to the original sources. The eBook ("An Introductory Guide to Computer-Generated Works") is hosted on GitBook. It provides links to the various articles I have ever written on computer-generated texts, as well as the full text of those articles if you like to use GitBook's UI or want to read these articles offline. You may download/star/subscribe to the eBook using this link.

I hope you like it.

A Real-World Example Of "Duck Typing" In Ruby

Tariq Ali — Tue, 21 Feb 2017 17:44:04 +0000

def get_first_element(object)
    object[0]
end

Here, I defined a Ruby method that will take any arbitrary object and call the [] method on it.

Now, normally you call the [] method on an array, so this method tends to work as expected with arrays.

array = [1,2,3]
get_first_element(array)
#=>1

But it doesn't just handle arrays. In fact, it handles any arbitrary object that responds to the [] method.

string = "Alphabet"
get_first_element(string)
#=>"A"

As long as the object responds to the [] method, that's good enough for me. And that is the basis of duck typing. It is based on the principle that if an element 'quacks like a duck', it is therefore a duck. If it quacks like an array, it is an array. And if I pass in an object that doesn't support the [] method, we'd simply see an error raised, like so...

awesome_object = Object.new
get_first_element(awesome_object)
#=>NoMethodError: undefined method `[]' for #<Object:0x007f9a93135b70>

The error message is probably most helpful for the programmer who has access to the source code of


, but at least the program halted at execution.

And that's all I have for now. I'll leave you with these cool code snippets of me getting the first element of integers. Happy coding.



```ruby
integer = 1
get_first_element(integer)
#=>1

Uh...

get_first_element(2)
#=>0

get_first_element(30)
#=>0

Cancel the happy coding.

Integers in Ruby also respond the [] method. According to the docs, integers appear to have a binary representation "under the hood", and the [] method gets me the


th digit of that binary representation. The docs give a pretty good example of what's going on...



```ruby
#Define a as the integer 13098, by using its binary representation
a = 0b11001100101010

#Retrieve that same binary representation by using []
30.downto(0) { |n| print a[n] }
#=> 0000000000000000011001100101010

After all, all numbers must have a binary representation, and that binary representation has to be stored somewhere, and you might as well treat that stored binary representation as an array. It sounds like completely expected behavior so long as you know to expect it in advance.

The duck has quacked, and so we assume it to be a duck. And it is a duck...don't get me wrong. Just a duck that I was unfamiliar with.

Coincidentally, I actually used


 in a personal side-project. I stayed up all night trying to track down weird bugs that occurred when I was passing integers into 

```get_first_element```

. Once I realized what was causing this, I added a quick hotfix to solve the issue...



```ruby
def get_first_element(object)
    string = object.to_s
    string[0]
end

Call the object's


 method, thereby creating a string representation of the object. Then, simply access the first element of that string. This meant that 

```get_first_element```

 handled integers as I **wanted** them to...



```ruby
get_first_element(1)
#=>"1"

get_first_element(2)
#=>"2"

get_first_element(30)
#=>"3"

...but now users will be surprised at how the method handles arrays.

[1,2,3].to_s
#=>"[1,2,3]"

get_first_element([1,2,3])
#=>"["

And objects that would previously raise an error under the old


 would now produce a "valid" result under the new 

```get_first_element```

, if the objects are able to "quack" (respond to the 

```to_s```

 method)... 



```ruby
awesome_object = Object.new

awesome_object.to_s
#=> "#<Object:0x007f9a9314f3e0>"

get_first_element(awesome_object)
#=>"#"

Luckily for me, the personal side-project only dealt with strings and integers. If my side-project ever had to support arrays or other arbitrary objects though (which is always a possibility, considering how often software changes), I would probably consider:

Writing a lot of ugly and overly-complex code to get get_first_element to behave as I intended it to (formalizing my beliefs about how the method is "supposed" to work through automated tests). - Writing documentation explaining and justifying all the quirks. Saying, "This is how it works, deal with it" just won't fly. - Eliminating get_first_element entirely and rewriting the side-project.

I dread having to take any of these approaches. If forced to choose though, I would pick "rewriting the side-project". My side-project was using


 to do some rather "hacky" stuff, and there's probably a much better (and less painful) way to do that same stuff. All I have to do is find it...and then implement it.
<hr>
This post is not an attack against duck-typing, although my reliance on duck-typing did lead to this error. Duck-typing is part of idiomatic Ruby, and taking advantage of it tends to lead to "cleaner" code. But there are always trade-offs to consider, such as the ducks doing behaviors that you didn't intend them to simply because the ducks knew how to quack. But I never even experienced these trade-offs -- until now.

This blog post was really an illustration of the [Generalized Peter Principle](https://en.wikipedia.org/wiki/Peter_principle): "Anything that works will be used in progressively more challenging applications until it fails." Generally, this principle is usually applied to human beings -- "Workers rise to their level of incompetence." Duck-typing is useful, and it is precisely that it was so useful that it led me to spend the whole night debugging an issue caused by my use of duck-typing.

I'm still going to use duck-typing. It's just too useful and convenient, and the odds of me encountering this issue in another Ruby side-project seems fairly low.

But I'm going to be more cautious and careful when programming. More importantly, I plan to be more comfortable with my ignorance...never assuming that I know more than I actually do about the program I am writing, the patterns that I am using when writing the program, the language I am writing the program in, and the requirements that I am writing the program for.

Being comfortable with ignorance means that I might be able to anticipate and prepare for situations where the Generalized Peter Principle comes true and my tools break hard.

Will programming be automated? (A Slack Chat and Commentary)

Tariq Ali — Fri, 03 Feb 2017 17:37:26 +0000

Software Developers Worry They Will be Replaced by AI (Press Release at 03/08/2016):

Developers fear that their own obsolescence will be spurred by artificial intelligence, according to a new survey of over 550 software developers to be released this week by Evans Data Corp. When asked to identify the most worrisome thing in their careers, the largest plurality, 29.1%, cited “I and my development efforts are replaced by artificial intelligence.” This was significantly more worrisome than platform issues, such as their targeted platforms becoming obsolete which was the second most cited worry at 23%, or that the new platform they're targeting doesn't catch on (14%).

The thought of obsolescence due to AI was also more threatening than becoming old without a pension, being stifled at work by bad management, or by seeing their skills and tools become irrelevant. While the developers who worried about AI were found across industries and platforms, one strong correlation was that they were more likely to self-identify with being loners rather than team players.

Dev Bootcamp is a 19-week accelerated coding program designed to teach people how to become web developers. Graduates to this program are then invited to an "alumni Slack", where they can talk with other alumni about their future career in programming. (Disclosure: I have graduated from this program and was invited to this chat.)

On February 1st 2017, a discussion took place in the "alumni Slack" about the future of automation in software development..sparked by a off-hand comment in a separate conversation. I lightly edited the transcript below after getting permission from those who participated in this discussion. To protect the identities of the people involved in this discussion, their names have been changed.

daphney.cronin - "Coding is a commodity these days and so while you do need to be competent, the thing that's going to win the offer is your ability to click with the interviewers and your genuine desire to be part of the team and contribute to the product."

This statement makes me dread for the future of my current profession. It's true, of course, but commodification of skillsets will ultimately lead to automation of those same skillsets.

adaline_stracke - I think the automation of development is pretty far off though. There are only so many things that a WYSIWYG editor can do, and the code isn't very maintainable. The main challenge that I see in programming isn't writing code, it's writing code that is human readable because computers are dumb and AI hasn't progressed far enough to debug the way a human can.

The WYSIWYG is a “for instance” BTW.

kallie - I dunno, I think making a computer drive a car is a lot harder than making it write code.

adaline_stracke - I disagree. Making a computer drive a car is a series of concrete decisions where as writing code is a series of decisions that have opinion behind them

skittles_mcbangbang - I see things like â€˜serverless infrastructure' has a harbinger. take a look at aws lambda, you put code on a webpage, and it gives you an endpoint to hit.

How long before someone writes some sort of adapter that takes business logic and written text and turns it into behaviour. I mean, we already have testing frameworks that sort of do that (cucumber).

adaline_stracke - But that's okay. Because it takes someone who knows how to create the business logic. I can learn a new syntax fairly easily, but that's not what programming is.

Actually, isn't that what Ruby is? We write logic in Ruby and it compiles to C.

kallie - Yes higher level language abstraction is already semi-automated code. It all becomes machine code in the end. There's no reason to believe this trend won't continue.

skittles_mcbangbang - Non-programmers are also capable of logic, and one might argue have a better grasp of what the intended product is supposed to look like. And if you provide them a quick enough feedback loop between them writing an english sentence and seeing app behavior change...

kallie - Yeah that would arguably be better than your normal development pipeline. Hyper-rapid development and testing.

adaline_stracke - I guess what I'm trying to say is that the leap from writing code as we do now to full natural language processing and translation into behavior is very big.

kallie - It's big but I don't think it's as insurmountable as many people would assume it to be.

adaline_stracke - Agreed.

skittles_mcbangbang - Agreed. We might not need the natural language processing bit either. It wouldn't be hard to train people to write their english in a certain structure. it would just be part of the job training.

https://github.com/cucumber/cucumber/wiki/Gherkin

 1: Feature: Some terse yet descriptive text of what is desired
 2:   Textual description of the business value of this feature
 3:   Business rules that govern the scope of the feature
 4:   Any additional information that will make the feature easier to understand
 5:
 6:   Scenario: Some determinable business situation
 7:     Given some precondition
 8:       And some other precondition
 9:     When some action by the actor
10:       And some other action
11:       And yet another action
12:     Then some testable outcome is achieved
13:       And something else we can check happens too

adaline_stracke - Where I work, we tried doing something similar for testing, but it wasn't successful.

I guess I fail to see the difference between teaching someone to express logic in one structure (like a programming language) vs another (such as what you pasted there). Fundamentally, it comes down to specific training which is what was talked about above.

skittles_mcbangbang - It's extremely helpful in microservice envs, IME. It was an easy way abstracting the several different calls by simply stating the prerequisites needed for the actual tests to run.

daphney.cronin - The difference is that another human (with no training of how to express logic in that formal fashion) can read the specs and understand exactly what is occurring.

adaline_stracke - @daphney.cronin, I get the idea of readability, but we were talking about the automation of WRITING code.

daphney.cronin - Yeah, this type of approach really wouldn't help in automating code (since you're replacing one type of formalization with another). A corporation that wants to automate code would be better off trying to improve code generators (like Rails) or CMSes (like RailsAdmin, ActiveAdmin, or Wordpress)...

skittles_mcbangbang - That doesn't remove the need for a dev though, since in using those frameworks, you're still creating and maintaining servers. If you can type gherkin into something like Lambda or OpenWhisk, then you start to get closer to replacing developers.

adaline_stracke - I think that's the real rub here. Until AI is as good at debugging and maintaining as humans are, there will still be needs for devs.

I think it's the same idea as self-replicating robots. Until robots are good enough to fix and rebuild themselves we need people to maintain them. Until software is good enough to fix and write itself, we will still need people to fix and write software.

skittles_mcbangbang - No doubt, but it's still possible to replace a LOT of devs with the scenario above. You still need people to build and maintain those types of infrastructures, but waay less.

Commentary - Computer scientists have conducted research into the automation of bug fixes -- the most promising program at the moment is GenProg, a program that uses genetic programing to mutate existing codebases to fix bugs. According to an interview by its developers, GenProg is very cheap ($8/hour) and speedy, but the computer-generated code is not as maintainable as human-generated code. GenProg also is heavily dependent on the specifications to determine whether it has fixed the bugs, so the software engineers using GenProg would switch from writing code to writing tests. The "formalization" of business logic would still remain in human hands.

The discussion assumed that the needs for software will stay constant. This may not be true -- instead, software is increasingly required to be made more complex and to handle more stuff. This leads to the Complexity Paradox, based on Tog's Law of Commuting:

"The time of a commute is fixed. Only the distance is variable." Translation? People will strive to experience an equal or increasing level of complexity in their lives no matter what is done to reduce it. Make the roads faster, and people will invariably move further away.

... Given that people will continue to want the same level of complexity in their lives, given that we will continue to reduce the proportion of complexity of any given function that we expose to the user, we may expect that the difficulty and complexity of our own tasks, be they at the application or OS level, will only increase over time. That has certainly been the case so far--we've gone from simple memo writers and sketchpads to document processors and PhotoShop. And we may assume that's only the beginning.

So when we build higher-level languages, code generators, and CMSes, we merely encourage users of those tools to do more "work" with those tools. This increases the complexity of the resulting software, which would require more maintenance work. Software is constantly being asked to be changed and upgraded, and the field must keep up with the demands.

This would seem to suggest that programmers would have high job security (and Tog would make that very assertion at the end of his essay). However, the goal of all automation is cost-savings and greater efficiencies. There is cost-savings involved when you fire programmers and replace them with cheaper "specs writers". And as software becomes more complex and convoluted, we will probably rely less on very-fallible humans and more so on less-fallible machines.

Therefore, it is safe to say that many developers will still remain afraid of automation/AI for quite some time, and so the debate will still rage on...

Why I Switch From [Language_1] to [Language_2]

Tariq Ali — Fri, 27 Jan 2017 16:27:43 +0000

I am a big fan of [Language_1] and one of its early adopters, having been disappointed with the utter failures of [Language_0]. I have been an avid contributor to many open source projects such as [Obscure_Project_1], [Obscure_Project_2], and [Obscure_Project_3]. However, after using [Language_1] for over 5 years, I have been dealing with [Minor_Technical_Flaws]. At first, I ignored and even tolerated these flaws, but I was forced to confront reality. I could not live with these flaws, and since [Language_1] is a mature language, it will be difficult, if not impossible, to actually fix [Minor_Technical_Flaws].

I also resented the opinions that [Language_1] took. When I first used those languages, I thought those opinions were a fresh breath of air. I would constantly defend those opinions on Hacker News, because I thought that these opinions would promote good coding practices and would ulitmately lead to higher productivity. After 5 years of coding though, I realized that those opinions may have been slightly flawed.

And then, I heard of [Language_2]. [Language_2] promised to fix all the technical flaws in [Language_1]. I also read about the opinions inherent in [Language_2], and realized that those opinions were utterly correct, compared to the fallicious and wrong-headed opinions in [Language_1]. And that's why I switched languages.

Let me show you an example to prove my point.

Here is an example of a “Hello World program in [Language_1]:

PROGRAM HELLOWORLD
 10 FORMAT (1X,11HHELLO WORLD)
 WRITE(6,10) 
 END

You can see why I was drawn to this language. Its syntactic sugar was beautiful. But it was too magical. I didn't know what was actually going on. What did Format mean? Or Write? When I had to scale up my program, I would be cursing all night trying to debug the latest ‘magic' that [Language_1] decided to invoke on me. Half the time, I would be fighting against [Language_1]'s constraining limits.

And here's an example of “Hello World in [Language_2]:

>+++++++++[<++++++++>-]<.>+++++++[<++++>-]<+.+++++++..+++. [-]>++++++++[<++++>-]
<.>+++++++++++[<+++++>-]<. >++++++++[<+++>-]<.+++. — — — . — — — — .[-]>++++++++
[<++++>-]<+. [-]++++++++++.

You can see how easily readable and maintainable [Language_2] is over [Language_1]. Its syntax is just as elegant as [Language_1], but there's no magic involved. Everything that I need to know is right there, ready to be understood in an easy-to-digest fashion. There's no fighting with the language here, just me working with my trusty new tool.

Always use the best tool for the job. That's why I have chosen to use [Language_2] for all my obscure side-projects. It will likely not have any technical flaws whatsoever, nor will its opinions ever turn out to be wrong. And, in the rare case that [Language_2] disappoints me…well, there's always [Language_3]…

Note: This article was previously published on Medium under a pseudonym. After publication, qiajigou used this template to create an open-source project allowing you to generate your own "language switching" rant.

Untold Problems of a Software Blog - Bots

Tariq Ali — Sat, 07 Jan 2017 18:45:59 +0000

According to a 2015 report by Incapsula, 48.5% of all web traffic are by bots. If you are running a small website (10 to 1000 visits per day), 85.4% of your web traffic are by bots. Even if you happen to run a very large website (100,000 to 1 million visits per day), 39.7% of your web traffic are by bots.

Put it frankly, most of your audience are composed of robots. (I've learned this the hard way when I was maintaining a website for a Texan religious organization and found out from Google Analytics that 90% of people visiting the site were from Russia.)

A minority of bots are "good bots" - crawlers and web scanner tools. Their goal is to visit websites and provide valuable information to another user. Search engines aggregating results based on keywords, chatbots looking to answer users' natural language queries, etc. Googlebot is a prime example of a "good bot" (and is actually pictured in the header of this blog post holding flowers).

The majority of bots are "bad bots" - scrapers that are harvesting emails and looking for content to steal, DDoS bots, hacking tools that are scanning websites for security vulnerabilities, spammers trying to sell the latest diet pill, ad bots that are clicking on your advertisements, etc. (Okay, the ad bots are probably "good" from the perspective of the blogger with ads, but they're still pretty bad for the companies paying for the ads.)

I have delivered a speech about bots in June 2016 - here's the slides for it. But I am writing this blog post as a response to Untold Benefits of a Software Blog. The arguments in that blog post are correct. But any discussion of blogging must take into account the audience of that blog. And that audience for most websites will be machines. Here's the problems that bots can cause:

Your Metrics Are Unreliable - You can hire a botnet to visit your website, scroll up and down the page, execute JavaScript to trigger your analytics software, click on ads and other links, write comments about your blog post, like your blog posts, retweet your blog posts, etc., etc. Even if you don't want to fake your metrics to boost your vanity, bots may still behave as normal users to avoid detection by other algorithms. This may mean that they will visit multiple "innocent" sites (including yours) before heading over to their final destination. Bad analytics make it difficult to target your content properly, since how else would you know if your content is actually working on human beings?

You Are Writing For the Machine - There is just so much content on the Internet that humans are unable to consume it all. So they rely on aggregators and filters. But content producers then realize that they shouldn't write high-quality content. That is useless because nobody will discover high-quality content without the help of aggregators and filters. Instead, content producers should write content that appeals to aggregators and filters. This is the ultimate premise behind SEO (search engine optimization), and the end conclusion is people writing articles with long-tail keywords to secure visits from search engines. SEO is the natural consequence of Goodhart's Law ("When a measure becomes a target, it ceases to be a good measure"). Search engines' dependence on links to determine whether websites are popular or not led to the rise of "link farms". Google was also responsible for the rise and fall of content farms such as eHow - Google's algorithms once prized such "fresh content", only to later changed their minds with the "Panda" updates. Even today, Google encourages "quantity over quality" in content generation:

While one high-quality article might drive a thousand shares, 10 articles that drive 120 shares each is more. Replace shares with traffic or conversions. It's the same concept. In this way, Google is actually encouraging us to commoditize our content in lieu of creating great content, whether it's purposeful or not.

This perverse trend is ultimately unsustainable, but it may still continue for a while longer. Marketers are already discussing how the Washington Post is able to quickly generate more content by using algorithms, and how incumbents could drown out competition from upstarts by mass-producing low-quality content.

Your Content Will Be Divorced From You - Content on websites such as dev.to are reposted elsewhere, word-for-word, by scrapers programmed by Black Hat SEO specialists. These specialists hope to populate their websites with cheap content...and what could be more cheap than copying and pasting? If you're lucky, they're willing to give you a backlink as a meaningless "thank-you note" for providing the content that they so desperately need. Most of the time, they won't bother. Even "legitimate" actors such as Google are getting into the business of taking your content and reusing it, through the use of featured snippets.

However, a new breed of scrapers exist - intelligent scrapers. They can search websites for sentences containing certain keywords, and then rewrite those sentences using "article spinning" techniques. As much as I find it distasteful to link to Black SEO sites, here is a YouTube demonstration for Kontent Machine 3, one of these intelligent scrapers (so you can observe how it works). The output is decent enough to please a search engine algorithm, but still need work before humans could also enjoy the output. (Disclaimer: I am working on 'content aggregation' software that can take paragraphs from an external CSV file and then rearrange them into unique articles. I am still thinking of a good way to open-source it without helping intelligent scrapers.)

I think it is likely that as text generation techniques get better and as AI becomes more advanced, people will begin recycling content...downplaying or even eliminating the content creator from the picture. The machine curator is praised while the human content creator is devalued or ignored.

I gave a more positive spin of this trend in the article Who are the Audiences of Computer-Generated Novels?, in the section entitled "People Who Want To Consume A Existing Corpus". After all, there's just so much data out there that we do need algorithms to make sense of it all, and the curator does deserve some "credit" for its role in bringing order from chaos. And these algorithms are exciting. But exciting algorithms also carry negative consequences as well -- consequence that we have to be aware of and prepare for.

Imagine a chatbot that resides in your terminal. You can ask this chatbot a coding question, without ever going to StackOverflow and seeing a person's name or (shudder) avatar:

#> howdoi python for loop
#=>
#for idx, val in enumerate(ints):
#    print(idx, val)

That world already exist. It's called howdoi, a open-source command-line tool. Of course, it's not perfect (neither is StackOverflow), but what surprised me the most about howdoi is that it can do qualitative answers as well:

#>howdoi evolutionary algorithm
#>The pro: Evolutionary algorithms (EAs), as genetic algorithms (GAs), are general 
# purpose optimization algorithms that can be applied to any problem for which you
# can define a fitness function. They have been applied to nearly all conceivable
# optimization problems, see e.g. the conference series on „Parallel Problem
# Solving from Nature“. The con: EAs are generally much slower than any specific
# optimization algorithm adapted to the problem. I would apply them only if all
# standard optimization algorithms fail.

#>howdoi solve the halting problem
#>The normal approach is to constrain program behavior to an effectively calculable 
#> algorithm.  For example, the simply typed lambda calculus can be used to 
#> determine that an algorithm always halts.  This means that the simply typed 
#> lambda calculus is not Turing complete, but it is still powerful enough to 
#> represent many interesting algorithms.

Humans wrote this content. But humans are being denied credit for their effort, as it is the algorithm ("howdoi") that is curating this content and showing it to the end user. Note the lack of backlinks in the outputs of "howdoi" as well, meaning "howdoi" is technically engaging in plagiarism. "howdoi" is the future, and the future doesn't look good for human content generators.

Conclusion: I am not necessarily saying "Don't bother writing blog posts because the only people who are going to read them are bots". There are good reasons to write a blog post. Just know that humans are not your primary audience for these posts. If you treat blogs mostly as vehicles of self-expression and communication with other people within a niche, then you'll go far. If you treat blogs as a ticket to gaining fame with other human beings though, you're probably going to fail hard.

Correction:

While it is true that howdoi is currently engaging in plagiarism, the developers are aware of this and a PR have been opened up to fix this issue.

NaNoGenMo 2016 and My Predictions About Text Generation

Tariq Ali — Fri, 23 Dec 2016 05:39:01 +0000

NaNoGenMo 2016

NaNoGenMo stands for National Novel Generation Month. Can you write code that can generate a 50,000-word novel in one month? Since it started in 2013, it has slowly gained a reputation for innovation in tackling a very hard problem (generating text at scale) in a variety of different ways.

NaNoGenMo 2016 appeared to be less active than NaNoGenMo 2015 or NaNoGenMo 2014. It had less issues and less stars overall. Media attention, once widespread and viral, has dropped off significantly - the only news article of note was Nerdcore's acknowledgement of the existence of the generative novel The Days Left Foreboding and Water. Coverage appears to be limited to sympathetic sources (usually from other programmers).

However, a sea change in attitudes appear to be occuring beneath the surface of low activity. Knowledge about generative text has been formalized in blog posts (such as the ones that I have written before on dev.to, but also a Russian-language blog post) for easy reference. More newbies seemed eager and passionate to participate. Computer science instructors are giving their students extra credit if they submit an entry. One class made it a required assignment to submit a generative book. The computer-generated prose is now much more readable and interesting, though more work is still necessary.

The field of text generation is consoldiating, and possibly improving for the better. Which leads to a question...what happens next in the field? Looking at what happened in NaNoGenMo 2016, I am willing to venture to make 3 predictions...

Statistical Approaches Will Be Part Of A 'Toolkit' In Text Generation

During NaNoGenMo 2016, Vincent Toups posted a Tweetstorm examining two major approaches used in in text generation - the "statistical" approach and the "structural modeling" approach. I've focused a lot on "structural modeling"...but less so on "statistical" approaches (e.g., machine learning).

["The statistical" approach] is an attempt to create an ad-hoc generative model w/ a data set. The generative approach is broad: recurrent neural networks or word vectors or markov models are tools used in this approach.

Vincent Toups points out two issues associated with the "statistical" approach:

The statistics of a given novel are very bad. Most novels contain a series of distinct sentences. Even for a loose definition of "distinct". As such, if you are trying to learn a model for sentences, you don't have enough data.

... The second major issue for naive models is structure: novels are not sequences of sentences. They encode complex, large, correlations. Most notably, plot. But style, characterization, etc all live at scales which naive models have trouble characterizing.

Statistical approaches are good for presenting text that looks human-ish. It is very effective at producing evocative words, phrases, and sentences. However, the generated text quickly turns into gibberish with even short-term exposure. The generated text could not "scale" upwards to novel-length works.

The previous blog posts I have written about "text generation" deals with what Vincent Toups call "structural modeling". The idea with modeling is that the programmer writes code that 'encodes' the plot/style/characterization at a large scale. Even these models have their weaknesses and faults, but they do appear to produce far more readable text than "statistical" approaches. As Vincent points out -

[They] creep towards non-generated works, since the author is very present in them.

While Vincent points out that limits of both the "statistical" approach and the "structural modeling" approach, he also expressed hope of a fusion between the two:

I think a genuinely successful attempt will involve hybrid approaches: models for story and arc, statistical techniques for style. Even then, I suspect the results will lack a human feel. Literature, after all, is a complex communicative act between humans.

In 2016, a few programmers have begun moving towards this hybrid approach, using "statistical approaches" and "structural modeling" to generate novels:

Isaac Karth used Textacy, a Python module intended for high-level 'natural langauge processing', to search within a Gutenberg corpus for key sentences that can be combined together into templates. These templates would then be used to generate scenes for his Pirate Novel. Isaac, however, admitted to also using 'grep' to find key sentences as well.
Superphreak uses sentence similiarty to generate 'workplace gripes'...a sentence would be randomly chosen from a corpus, along with a few similar sentences. This produces a coherent mini-rant, as each sentence in the mini-rant relate to each other.
Fire Up The Pulp Mill! generates short stories set in fantasy universes. The algorithm uses Markov chains to generate the name of various locations that characters can explore, and uses templates and combat simulations to explain what the characters actually do at those locations.
Annales is a novel about the history of a mighty country. It uses a Neural Network to generate "not just names but creatures, adjectives, abstract nouns, weapons, food and drink, and so on". The generated words were filtered to match certain criteria (for example - the name of all women must end with the letter 'a'). According to Mike Lynch, using neural networks to generate words "made the output more mysterious and got rid of a lot of the jokey anachronism of early drafts".

It is likely that this "hybrid approach" may be fruitful in the coming years. By taking advantage of both modeling and statistical techinques, the programmer will get the 'best of both worlds'.

Corpus Curation Will Be A Necessary Skill In 'Text Generation'

Can a human write a story without any knowledge of his world whatsoever? With no concepts, ideas, even language? I doubt that it is possible. Humans require socetial input to be able to generate meaningful output. Machines require input as well, in the form of a corpus.

It is not a concidence that the current hype in AI coincides with the rise of Big Data. Machines are able to process and manipulate data much more effectively than a human can, thereby making them suitable for generating useful statistical predictions and analyses about the world (even as one concedes the machines don't really "understand" the world).

The text-generation equivalence of data is the 'corpus' - written text that is fed into a computer. The computer can stitch together text from a corpus to generate a story, just like a human uses letters from a language to generate a story. Corpus is required for text generation, even if the corpus is as basic as templates. Even machine learning algorithms such as RNNs require some kind of pre-existing corpus to train on. A program requires a corpus to write effectively (just like humans require pre-existing knowledge to write effectively).

The upshot is that the programmer must also engage in "corpus curation". The programmer must search for text that can then be used by the algorithm to generate stories. The process of finding a corpus can be difficult and time-consuming, so many NaNoGenMo programmers rely on Gutenebrg and corpora for their text. If a specific corpus is not easily available, then the programmer must handwrite the corpus, a rather painful process.

Generally, as the corpus grows larger, the generated output becomes more varied and interesting. This means that the best way to improve a text generator is to simply gather more data. Of course, the process of gathering data is itself an expensive endeavour, forcing programmers to be more creative in how they use the corpus they already have.

Examples From 2016 NaNoGenMo:

Superphreak is a novel where the main character dives into dumpsters to find valuable newspaper clippings and technical manuals about his world. Whenever a dumpster dive is successful, new pieces of the corpus gets shown to the user.
The Track Method places passages from Gutenberg right next to each other to generate a human-readable 13,606-word novella about a secret agent living in an underground complex.
If on a winter's night a library cardholder is a story about a person trying to find a specific book from the the Gutenberg corpus. Whenever the main character opens up a book, you see a random passage straight from that book, before the character decides whether to keep the book or return it to the shelves.

The Goal Of The Programmer Will Be To 'Scale' Novel Experiences, Not To Save Money

The traditional narrative about automation is that machines are able to do the same job that a human can, but at a much cheaper rate. As a result, the human is likely to lose employment since the automation is more productive.

While I subscribe to this traditional narrative for most industries, it is a bit harder to apply it to generative text. Writers' wages are already very low to begin with (and nonexistent if you include volunteer bloggers and social media users). Any efficiency gains for switching from cheap human labor to cheap automated labor must be limited...and that is if you ignore the initial setup costs. Cost-savings may exist in some form, but it will not be why people will use computers to write.

Instead, computers has the advantage of 'scaling' up interesting ideas. A human may balk at the idea of writing personalized sonnets for a thousand people at a moment's notice, generating dialogue trees for hundreds of different characters in a cyberpunk video game, or writing twelve horrible James Bond puns. But a computer doesn't care. Code is just a way of formalizing processes, and the computer enables a programmer to turn a creative idea into a physical, final product.

There's a limit to how far computers can 'scale' up those interesting ideas though. It is very easy to produce infinite variation, but humans are good at pattern matching and will easily discover the patterns in your generation. Once humans discover the patterns, then they will likely get bored. Human literature has patterns too, but the patterns of a machine is likely to be more simplistic and easier to identify and pin down.

The clever programmer can work around this limitation by developing a more intereting pattern, creating multiple patterns, or increasing the corpus by which the machine draw from...but that requires just as much work as hand-crafting the content. This is why Orteil, the developer of Cookie Clicker, tweeted:

thanks to procedural generation, I can produce twice the content in double the time

Still, computers can tell stories that humans have never even bothered to think about telling before. By providing new and interesting experiences, computer-generated literature might be able to secure a worthwhile niche. Possibly this may be all it can hope for, as machine-generated literature becomes its own unique genre, with its own dedicated fanbase (while most humans are happy with human-generated works). But it is also possible that the appeal of novelty may enable machine-generated literature to reach mainstream status.

Examples From 2016 NaNoGenMo:

The Days Left Foreboding and Water engages in 'blackout poetry', where certain words are highlighted in text while all other words are blacked out.
Pursuit - A Paranoid Thriller is a novel about an increasingly frantic chase scene, with political coverups, shady
meetings, and insecure chatrooms.
Dear Santa uses Tweets to write "[a] contemporary epistolary novel" about people's desires.
A Primer In Gont Poetics is an academic textbook about humans attempting to translate and understand alien poetry.

Appendix

This blog post is computer-generated, using a handwritten corpus. Source code is here. I use the Ruby gem Prolefeed to generate this blog post.

Previous Articles In Text Generation

Who are the Audiences of Computer-Generated Novels?

Tariq Ali — Sun, 18 Sep 2016 03:08:01 +0000

Previous Text Generation Articles:

5 - Consumers of "Spaced Content"

"[Y]ou could probably slog through any NaNoGenMo novel, if you did it in small enough pieces and gave yourself enough time between pieces."--Chris Pressey

Chris Pressey wrote this comment after writing the algorithm to generate the novel "A Time For Destiny". He used a Story Compiler approach (discussed in Structure in Computer-Generated Novels) to help make his novel coherent and readable, but still wanted to find another gimmick to improve its readability. He finally decided to place a warning label on his novel.

SPACE SURGEON GENERALS WARNING: THE SPACE SURGEON GENERAL HAS DETERMINED THAT EXPOSURE TO LARGE AMOUNTS OF COMPUTER-GENERATED FICTION MAY CAUSE HEADACHES, DIZZINESS, NAUSEA, AND AN ACUTE DESIRE TO SKIP AROUND TO FIND THE GOOD BITS.

FOR YOUR OWN WELL-BEING, DO NOT EXCEED THE RECOMMENDED MAXIMUM INTAKE OF 2 (TWO) CHAPTERS IN ANY 48 (FORTY EIGHT) HOUR PERIOD.

We can call this the "spaced content" approach: the content is made artifically scarce so that people are more likely to prize and enjoy it. Just because an algorithm can generate infinite content doesn't mean that you should...it would be better to simply ration it out so that people don't get bored.

Of course, humans may not actually listen to the warning label. They may decide to binge-read the content instead of simply waiting 48 hours. However, you could theoretically "force" the user to consume content more slowly by simply sending it out on a regular schedule. If they want to read a third chapter, they have to wait until the algorithm publishes the next chapter.

Twitterbots are a great example where "spaced content" is succesfully used. They tend to retain their appeal on the ability to "space out" content for a casual audience to consume without getting bored. It seems much easier for a human to browse through 200 tweets than it is to read 50,000 words...especially if those 200 tweets are spaced out over several weeks or months. Twitterbots have became so popular and mainstream that Quartz has released listicles listing the best Twitter Bots in both 2014 and 2015.

Many Twitterbots had their origins as NaNoGenMo novels: Uncharted Atlas used the same codebase as The Deserts of the West, and My True Love Sent To Me uses the same code that The One Hundred And Sixty-Five Days of Christmas used. sixworderbot (a Twitterbot that tweeted out 6-word stories) is a unique case where the bot came before the novel: Eight Thousand, Three Hundred and Thirty-Four Six-Word Stories.

The fact that even human authors are writing "Twitter Novels" (breaking down a novel into smaller bite-sized tweets that people can more easily consume)...suggest that "spaced content" is a viable and a lucrative approach for robots (even if they do decide not to go down the Twitterbot avenue).

4 - Fans of Conceptual Art

What if someone translated Moby Dick into Cat Language? Well, someone wrote an algorithm. Here's the first sentence of Moby Dick, and its translation.

Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

Meow me Meeeeow. Meow meeow mew--meoow meow mew meow meeeeooow--meeeow meeeow me me meeow me me meoow, mew meeeoow meeeooooow me meooooow me me meeow, M meeooow M meoow meow meoow m meooow mew mew mew meooow meow me mew meeow.

Since you get the joke, you don't have to read the whole novel. You don't need to consume every last Meow. The novel has served its purpose in conveying a concept to you. And is that not the whole purpose of a novel? To convey an idea to another person?

Conceptual Art's emphasis is on the idea or concept behind the work, and not the actual work itself. And code is a good way of formalizing abstract ideas and making them concrete. Isaac Karth, a NaNoGenMo participant, mentioned this in his own blog post about conceptual art:

One of the things that I like about NaNoGenMo in particular is that many of the projects are about taking an absolutely bizarre idea–what if you had a book that consisted of only the violent parts of The Iliad? What if you took all the dialog out of Pride and Prejudice? What if Alice went to Treasure Island? Or was chasing after Moby Dick? What if you had a story about recursively polite questioning elves? What do 50,000 meows actually look like?–and make them into actual books that you can read.

Jorge Luis Borges used reviews of fictional books to explore ideas without having to write out an entire novel. Procedural generation lets me explore ideas in a similar fashion–what if the characters in the stories in the 1001 Nights told stories about characters who told stories about characters who told stories…recursively?–but in a way that lets you read the result.

Now, I am not a fan of conceptual art. Just because you can generate a novel with a cool idea...and that you can even quote evocative passages from the novel...doesn't mean that you want to read all of the final, generated output. The translation of Moby Dick is sorely lacking and is not at all human-readable. It's possible that those who would most enjoy this "conceptual art" are the programmers who built this art.

At the same time, the Moby Dick translation received national press coverage. So maybe "generative art as conceptual art" might have a future, in pandering to people who like to see weird ideas come to life.

3 - Lovers of Personalized Content

What if you want to star in your own novel? One approach would be to write out a novel with a self-insert character, but that may be a long and tedious proces just to glorify your own ego. What if you have a personalized book instead? You provide information about yourself, and then an algorithm plugs that information into a computer template, thereby generating a novel with yourself as the main character.

Perhaps you are the only person who might like this novel. But you still enjoyed it, right? And the template can be reused to "personalize" novels for other people as well. Instead of selling one novel to a hundred people, why not sell a hundred novels to a hundred people?

To my knowledge, no personalized books have ever been written for NaNoGenMo. However, the "personalized book" industry has been well-established, able to generate children's books, romance novels, mystery novels, young adult novels...if someone has written a template, then you can generate a novel for yourself...as long you provide the proper keywords. A free version of the "personalized book" concept is also available at the plotgenerator website, generating a story along with a cover and title as well.

However, it is possible that as technology increases, personalization in literature can also be improved beyond mere templating. Game developers are already trying to personalize the gaming experience on a sophisticated level:

Games that tailor experiences to players have been around for a while. For example, zombie shooter Left 4 Dead has an AI game director that tweaks the action as you play. It looks through a catalogue of scenarios – what type of enemy, how many of them, what direction they will come from – and picks what it thinks will keep your heart thumping.

Combine an AI director with a catalogue of computer-generated material and we could start to see games built for individual players. Researchers are also working on systems that can automatically generate characters and stories. Ultimately, computer-generated games could drop us into tailor-made worlds that respond to how we play.

Or maybe we're just getting carried away again. “It's not clear to me that people really want to have the experience they think they want,” says [Gillian Smith, a researcher in procedural generation at Northeastern University in Boston, Massachusetts]. “We're not all good improv actors.”

“If you threw me into a virtual environment with characters that could respond to anything I said, I'm not convinced I'd drive that experience in an interesting direction,” she says. “Do we really want an awkward cocktail party experience in a game?”

If personalization does wind up working in interactive mediums (video games), the odds are likely that this same technology can also be applied to non-interactive mediums (novels). We could imagine a far-off future scenario where an algorithm writes a book based on your own personal tastes and needs, inserting personalized scenes into a generic plot outline. Or maybe we're just getting carried away again.

2 - Binge Consumers

"I don't mind repetition. I'm that asshole that reads every single option every single NPC in adventure mode (DF) has, and speaks to every single NPC, reads every single book and letter, and the description of every item in every game I've ever played."---Ehndras

Why do humans write novels? Presumbly, so that other people can read them. However, humans do not think alike. Each human has their own unique and varied tastes, that can only be satisfied by certain kinds of novels. What one person may view as the "Great American Novel" and would consume voraciously, another person may dismiss as utter trash and refuse to even read. The existence of different "genres" in literature is a testament to the different preferences within the human race.

Is it possible, then, to find a group of humans who would have a preference (or at least, tolerance) towards the mass, unfiltered consumption of machine-generated literature? This "group of humans" can be small, as few as only a handful of people. But their existence would certainly matter in terms of determining whether "computer-generated literature" is legitimate. A niche genre is still a genre. (For more information about my thoughts about computer-generated literature turning into its own genre, please read The "Commonsense" Problem In Computer-Generated Works).

This thought came to me due to comments by Ehndras, a fan of the in-development roguelike Ultima Ratio Rengum (being built by Dr. Mark R. Johnson). After reading a devblog update about a newly-implemented conversation system, Ehndras posted:

I can easily put 50+ hours into URR in half a week, if the content is there to be exhausted!

I replied, almost in utter disbelief, at this statement:

This is shocking to me. It is very easy to produce unlimited content but getting a human to consume all that content is impossible since humans can easily get bored by that content. So writing an algorithm to produce a 50,000-word novel is easy but writing an algorithm to produce a 50,000-word novel that a human will read is difficult (hence, the NaNoGenMo competition).

If you consume 50 hours of content, and in each hour, reads 1000 words (a huge underestimate), you will consume 50,000 words of content. Assuming that you can tolerate the possible repetition of content, could Mark be the first person to have produced a human-readable, computer-generated novel...simply because, you, Ehndras, was willing to consume this content?

Ehndras, in his response to me, claimed that he can handle the task of reading 50,000 words due to his unique love for literature in general and his mastery of speed reading.

I'm a writer. Character development and lore are what I love in games. Otherwise, I wouldn't have played Final Fantasy X 60+ times, beaten Kingdom Hearts a dozen times, Played every single Fallout and Elder Scrolls including the oooooooooooooooooooold versions, etc.

Also, I can easily read a 500-page hardcover novel in a few short hours. I read ridiculously fast thanks to a weird technique I developed as a kid. I loooooooooove reading! I spend all day on Quora reading and writing :)

As of the time of this post, URR is still vastly incomplete, so Ehndras had not started on this quest to consume content. So we do not know yet whether Ehndras would successfully complete "50+ hours" of gameplay. However, the existence of readers such as Ehndras could bode well for the future of computer-generated literature. Instead of attempting to create "better" algorithms or working around their limitations, programmers could attempt to market their algorithms to humans who already want to binge-consume their machine-generated content.

1 - People Who Want To Consume A Existing Corpus

The idea that a computer could generate infinite content might be appealing on a surface level, but has some flaws. The major flaw that is mentioned is that a user may get bored with reading infinite content, once they see the patterns.

One often-neglected flaw is not mentioned that we already have infinite content that are produced by human beings. In fact there is so much content that many marketers are worried about a Content Shock, as users start to limit the amount of content that they are willing or able to consume. Reading itself is being transformed to skimming, as an attempt to adapt to the mass amount of information already on the Internet.

What if algorithms, instead of trying to generate new stories, serve to simply present a existing human-generated corpus in a new format? This could allow us to better sort through the infinite human-generated content that already exist, allowing us to extract valuable information from them. Using a human-generated corpus is also useful since the humans have already done the hard part of producing "readable text"...all the algorithm has to do is present that text in a useful fashion.

Technically, search algorithms have already been used to help us handle infinite content (Google is a prime example of this). More advanced content filters may be used in the future as well. But computer-generated novels can also serve to present and curate information within a corpus as well.

Consider the following computer-generated novels, which took an existing human-generated novel (that is already in the public domain), and then modified it to make it more appealing to different audiences:

Adventures of Conan Pelishtim is essentially the "The Adventures of Tom Sawyer", with all the proper nouns replaced by characters from 'Jewels of Gwahlur'
God's Thoughts in Nebuchadnezzar / Through The Saying-Mouth are novels that combines vocabulary from the "Alice in Wonderland" series and the "King James Bible", thereby producing a novel series with apocalyptic themes
Gender Swap novels, where the gender of the various characters in "The Adventures of Sherlock Holmes" are modified

People are able to now interpret the corpus in a new light, and are able to examine it more carefully and gather its insights. The effect can be even more pronounced if the reader was not familiar with the original corpus itself (and thus is willing to accept the text that the computer generated more readily than the original source).

There are also situations where a computer-generated novel may use elements of an existing corpus instead of modifying the whole corpus itself.

Virgil's Commonplace Book is a travelogue, where the author travels across the world. The author then discovers or is reminded of certain public-domain passages, which are then printed out for the people to read.
Our Arrival is a "a procedurally generated diary of an expedition through fantastical places that do not exist", using sentences from Gutenebrg public domain sources to provide evocative descriptions
NaNoWriMo, The Novel uses Tweets about NaNoWriMo to generate a novel about a protagonist trying to write a novel in November

In these novels, the algorithm both tells a enjoyable story and presents a filtered view of a corpus (whether it is public-domain passages or online tweets).

Appendix

Just like every blog post on computer-generated literature, this blog post is computer-generated. Here's the source code.

I decided to be brave today by ordering the human-generated paragraphs based on a sentiment analysis algorithm. I use Senimental, a Ruby Gem that can determine the "sentiment" of textual content (specifically, whether a text is happy or sad). I then order the paragraphs in descending sentiment order, with the happiest paragraph first and paragraphs becoming progressively unhappy.

This approach was inspired by Kurt Vonnegut Jr.'s plot curves. Kurt believed that the plots of the story can be determined by how happy or sad each section of the story is, and I wanted to produce a blog post using the "From Bad to Worse" plot curve, where the story starts off happy and turns sad. (I have mixed-to-negative feelings about computer-generated literature, and I wanted it to show within this text.)

Here's the Sentiment values of each of the 'sections' in this essay according to the Sentimental gem (positive numbers indicate happiness):

Spaced Content: 11.394800000000005
Conceputal Art: 9.1441
Personalized Content: 5.679799999999999
Binge Consumers: 0.093799999999999
Corpus Consumers: -1.1868999999999996

But sentiment analysis is not objective. Humans can have different opinions on the mood of a text, and algorithms are no different. To compare, I decided to use AlchmeyAPI's own sentiment analysis algorithm to determine the sentiment of these paragraphs. And, to my shock, AlchemyAPI suggested a different ordering.

Here's the suggested ordering of the pagraphs by AlchemyAPI (along with their sentiment value...positive numbers indicating happiness):

Corpus Consumers: 0.217397
Binge Consumers: 0.163047
Conceptual Art: 0.0994031
Personalized Content: -0.00243115
Spaced Content: -0.0473679

This shows that your tools can have a very major effect on the final generated output. So...choose your tool wisely.

Instead of bookending my blog post with introductory and concluding paragraphs, I also decided to more explicitly adopt a listicle format, where each section is numbered off. The listicle is a quick and easy way of structuring content, and are thus very amendable to use by computers.

This will also be the last major article on "text generation". These articles have successfully scratched the surface of this interesting field, and trying to scratch the surface even more deeply will lead to diminishing returns. If you are still curious, I encourage you to conduct your own research into this field...and even to participate in this year's NaNoGenMo (held every November on GitHub). I may post another blog post if any new or revolutionary discovery occurs, but until then, I'm taking a well-deserved break.

The "Commonsense" Problem In Computer-Generated Works

Tariq Ali — Fri, 26 Aug 2016 13:31:04 +0000

Previous Text Generation Articles:

Introduction

"It's ... worth noting, I think, that many successful machine-made works, like the music of EMI or Cope's subsequent machine creativity project, Emily Howell, or the screenplay for Sunspring, rely heavily on interpretation by humans, making the machine prominence of the source a novelty which excuses the search for meaning instead of encouraging it.

By which I mean, it's difficult to say what Benjamin, the AI that wrote Sunspring, could possibly be getting at, because Benjamin is an artificial intelligence. The common sense read is that Benjamin is getting nothing."---Mike Rugnetta, Can an Artificial Intelligence Create Art?

Machine capabilities are constantly increasing in a variety of different fields. First limited to industrial automation, machines are now being used for 'creative' enterprises as well: music generation, painting generation, text generation, etc. However, while a machine is able to generate creative works, it does not understand what it is generating. From its perspective, it's simply manipulating symbols based on its external input and its internal programming. Machines lack the commonsense knowledge that we take for granted. Researchers are attempting...and failing...to implement commonsense knowledge in AI.

And the lack of commonsense knowledge could serve as a barrier to full acceptance of computer-generated works. In fact, I once wrote a blog post entitled Why Robots Will Not (Fully) Replace Human Writers arguing why algorithms will "only" write the majority of all literature, instead of displacing human authors entirely:

Humans take for granted their ability to perceive the world. Their five senses gives a continual stream of data that humans are able to quickly process. Bots, on other hand, are only limited to the "raw data" that we give them to process. They will not "see" anything that is not in the dataset. As a result, how the bots understand our world will be very foreign to our own (human) understanding.

... Humans will likely tolerate the rise of automation in literature, and accept it. Bots may even write the majority of all literature by 2100. But there will still be some marginal demand for human writers, simply because humans can relate more to the "worldview" of other humans. These human writers must learn how to coexist with their robotic brethren though.

However, implementing commonsense knowledge is not necessary for successful text generation. Unlike Mike Rugnetta, I believe that there are three approaches that can allow us to successfully search for meaning within computer-generated texts, even when the computers fail to understand or appreciate that meaning. Each approach carry with it is own drawbacks however, and the programmer must decide what trade-offs to make.

World Models

Computers and humans speak different languages. While we speak in natural languages, computers only understand programming languages. But what if we translate our ideas into a series of hard-coded rules based on real-life? Then the computer can read those rules and use them as a basis by which it can then generate coherent and meaningful text. The meaning, after all, comes from the hard-coded rules that the computer is simply executing.

The computer-science term for "a bunch of hardcoded rules" is the "world model", and when programmers started research into text generation, they immediately started using world models. In the 1960s, they built SAGA II, a computer program that can generate scripts about a gunfight between a cop and a robber, by writing rules on how the cop and the robber will behave when facing each other. In the 1970s and onwards, they built many story generation algorithms using world models. Some algorithms tried to simulate the behavior of different characters like SAGA II, while other algorithms attempted to simulate the behavior of the author's mind in developing the story and deciding what characters do in it. A few algorithms even implemented a form of 'self-evaluation' of the generated content, allowing the machine to 'revise' the story if it doesn't meet certain critera.

World models have also been used outside of academia. Liza Daly wrote about their current use in both video games and text generation, since they are good at providing coherence. Liza mentioned NaNoGenMo algorithms such as Teens Wander Around The House and The Seeker as examples of world models representing characters, while I may also mention A Time For Destiny as a world model representing the "author's mind".

According to Liza Daly, world models temporarily fell out of favor during the 1980s because of scaling issues. It just takes too much time for a human to write out these rules, and it takes too long for a computer to understand them using 1980s technology. Our computers are faster, and we may be better at abstracting away the rule-making details, so maybe this time is different.

However, the main drawback to using world models is that they limit the creative potential of the machinery. While hardcoded rules ensure that the generated works has some logical sense behind them (provided that the rules aren't buggy), it also excludes any possibility of interesting creativity. The generated works are sensible, but dull to read. Even the output of a Markov chain can shock you...sometimes. But the output of a world model may be too conventional and predictable. You reduce the risk of generating utter nonsense, but also you reduce the odds of generating something interesting too.

New Genre

It would be crazy to say that romance novels compete against historical fiction novels, or that people will give up reading science fiction once we learn how to mass-produce murder mysteries. Genres exist within literature, and they necessarily exist because human beings have different tastes and desires (though works can easily belong to multiple genres...a historical fiction romance novel, for instance). It makes sense therefore that computer-generated literature could exist as its own separate genre, adhering to its own unique conventions and appealing to a certain, niche audience.

The audience of computer-generated works may come from the fanbase of procedural generation, a computional approach used in video games to produce content. One video game developer, Bruno Dias, talked about procedural generation, in an interview for his own game-in-development, Voyageur:

In games, procgen originated as a workaround for technical limitations, allowing games like Elite to have huge galaxies that never actually had to exist in the limited memory of a 90's computer. But it quickly became an engine of surprise and replayability; roguelikes wouldn't be what they are if the dungeon wasn't different each time, full of uncertainty. Voyageur represents an entry into what we could call the “third generation of procgen in games: procedural generation as an aesthetic."

Voyageur is a "space exploration" game, which uses procedural generation to generate the various textual descriptions of the planets that a player can travel to. Bruno Dias stated that the goal of the procedural generation in his game is to "explor[e] the tenor and meaning of procedural prose". If people like Bruno's procedurally-generated descriptions, they may be receptive to future works that embraces this 'aesthetic'.

NaNoGenMo also seems to represent the ethos of "procedural generation as an aesthetic", with the various programmers interested in using their algorithms to express ideas in new and interesting ways. One of the many news article about NaNoGenMo compared the yearly competitions to Dadaism, and wrote how one competitor also saw influences of "Burrough's cut-up techniques" and "constraint-oriented works of Oulipo".

The author (Kathryn Hume) even gone further in its defense of "procedural generaiton as an aesthetic" by pointing out that most humans believe that the purpose of text generation is to "[write] prose that we would have written ourselves". Kathryn Hume believe that text generation would be better off focusing on other goals instead:

[W]hat if machines generated text with different stylistic goals? Or rather, what if we evaluated machine intelligence not by its humanness but by its alienness, by its ability to generate something beyond what we could have created—or would have thought to create—without the assistance of an algorithm? What if automated prose could rupture our automatized perceptions, as Shklovsky described poetry in Art as Device, and offer a new vehicle for our own creativity?

Now, the main drawback in this approach is that it is an admission of defeat...or at least, an admission that your work is only intended for consumption by a niche. A few people might love to "rupture ... automatized perceptions" and would embrace machine-generated texts, with their warts and all. However, I doubt that the vast majority of humans would embrace generated texts so readily. After all, Dadaism, cut-ups, and Oulipo did not take over the literary world. Even Kathryn Hume agree that big businesses prefer to invest in "human-like" text generation: "[I]nvestment banks and news agencies like Forbes won't pay top dollar for software that generates strange prose".

Human Curation

Machines are not perfect. Code can be fallible. Programs can generate output that is dull, boring, and uninteresting.

However, humans are also not perfect. They can be fallible. They generate output that is dull, boring, and uninteresting. The difference is that humans are (usually) able to judge the quality of their work, and determine whether their output is good or bad. The 'good' output of humans are the ones that get published. The 'bad' output are ignored and forgotten. Sometimes, humans can choose to 'edit' the work of other humans, turning 'bad' output into 'good' output.

In the same way, human curators can also review the output of the machine. They can generate hundreds of stories, and review them to find the ones most promising. The curators could then select "good" stories and show them to the general public, while editing or throwing away the "bad" stories.

The very act of curation also adds meaning to the computer-generated work, as the human curators use their own ideas and beliefs to decide what stories to select and what to reject. When you read the computer-generated work based on human curation, you are reading a work with both machine and human influences.

This approach is far more common than is generally acknowledged. Every time someone gleefully posts the evocative output of a Markov chain, there are other outputs that are filtered away, never to be seen by another human being. Even human journalists, whenever they cover the NaNoGenMo competitions, do not feel the need to copy and paste whole computer-generated novels into their news articles. Instead, they only choose key quotes from the computer-generated novels...quotes that they feel are interesting enough for their audience to read.

You can see this type of approach on full display at CuratedAI, a website that describes itself as "[a] literary magazine written by machines, for people". However, the human programmers can choose what works to send to this literary magazine, and it is perfectly fine for the human programmers to 'lightly edit' the generated output. The machine-generated works on CuratedAI can be interesting to read, but that's because the human editors are there to ensure it stays interesting.

The main drawback with the Human Curation approach is the manual labor involved in the process. While it is easier to "edit" a computer-generated work than it is to create the work in the first place, the human must still play a rather overt role in this "creative" process. There's also a philosophical question: if machines generate literature, and then humans heavily edit the literature before publishing it, then was the final output 'really' computer-generated?

Conclusion

We are currently unable to program in commonsense knowledge in our algorithms...however, this does not serve as a complete roadblock to text generation. Programmers are free to use the approaches I outlined above to ensure that the machines are able to generate text filled with meaning and creativity.

Each approach has their drawbacks and flaws. There is nothing scary about drawbacks and flaws, so long as you are aware of them. Trade-offs must be made all the time in software development. Text generation is no different.

Appendix

Just like every other blog post on text generation, this blog post is generated by a computer. Fairly simple text generation...the introductory and conclusion sections are fixed, but the three "approaches" are randomly shuffled. Here's the source code. The algorithm is very lazy but I've been spending a lot of time writing the content for this blog post, and I'd rather push something out of the door. Sorry. I'll see if I can try to think of some more creative text generation algorithms in the future.

This blog post was inspired by a comment discussion thread on the the blog post NaNoGenMo: Dada 2.0. James Kennedy and me (as "Realist Writer") discussed whether text generation could reach human standards and how to resolve the "friability of semantic trust" that could exist whenever robots produce terrible stories.

Defining The Industry

Tariq Ali — Mon, 15 Aug 2016 13:12:25 +0000

After reading Ben Halpren's blog post: "What industry am I in?", I thought about giving him a link to a lecture a friend of mine showed that explained his thoughts on the "complexity management industry", and perhaps used this lecture as a starting point for actually answering Ben Halpren's question. But I decided that excerpting from the lecture would be more worthwhile and useful. Defining the industry is more important than actually naming it.

This excerpt is taken from Lecture 1A of the MIT 6.001 Structure and Interpretation of Computer Programming, 1986, by Hal Abelson.

PROFESSOR: I'd like to welcome you to this course on computer science. Actually, that's a terrible way to start. Computer science is a terrible name for this business.

First of all, it's not a science. It might be engineering or it might be art, but we'll actually see that computer so-called science actually has a lot in common with magic, and we'll see that in this course.

So it's not a science.

It's also not really very much about computers. And it's not about computers in the same sense that physics is not really about particle accelerators, and biology is not really about microscopes and petri dishes. And it's not about computers in the same sense that geometry is not really about using surveying instruments.

In fact, there's a lot of commonality between computer science and geometry.

Geometry, first of all, is another subject with a lousy name. The name comes from Gaia, meaning the Earth, and metron, meaning to measure. Geometry originally meant measuring the Earth or surveying. And the reason for that was that, thousands of years ago, the Egyptian priesthood developed the rudiments of geometry in order to figure out how to restore the boundaries of fields that were destroyed in the annual flooding of the Nile.

And to the Egyptians who did that, geometry really was the use of surveying instruments.

Now, the reason that we think computer science is about computers is pretty much the same reason that the Egyptians thought geometry was about surveying instruments. And that is, when some field is just getting started and you don't really understand it very well, it's very easy to confuse the essence of what you're doing with the tools that you use.

And indeed, on some absolute scale of things, we probably know less about the essence of computer science than the ancient Egyptians really knew about geometry.

Well, what do I mean by the essence of computer science? What do I mean by the essence of geometry?

See, it's certainly true that these Egyptians went off and used surveying instruments, but when we look back on them after a couple of thousand years, we say, gee, what they were doing, the important stuff they were doing, was to begin to formalize notions about space and time, to start a way of talking about mathematical truths formally.

That led to the axiomatic method. That led to sort of all of modern mathematics, figuring out a way to talk precisely about so-called declarative knowledge, what is true.

Well, similarly, I think in the future people will look back and say, yes, those primitives in the 20th century were fiddling around with these gadgets called computers, but really what they were doing is starting to learn how to formalize intuitions about process, how to do things, starting to develop a way to talk precisely about how-to knowledge, as opposed to geometry that talks about what is true.

... Here is a piece of mathematics that says what a square root is.

DECLARTIVE KNOWLEDGE:

The square root of X is the number Y, such that Y squared is equal to X and Y is greater than 0.

Now, that's a fine piece of mathematics, but just telling you what a square root is doesn't really say anything about how you might go out and find one.

So let's contrast that with a piece of imperative knowledge, how you might go out and find a square root. This, in fact, also comes from Egypt, not ancient, ancient Egypt. This is an algorithm due to Heron of Alexandria, called how to find a square root by successive averaging.

IMPERATIVE KNOWLEDGE:

Make a guess (G)

Improve the guess by averaging G and X/G

Keep improving the guess until it is good enough

That's a method. That's how to do something as opposed to declarative knowledge that says what you're looking for. That's a process.

Well, what's a process in general? It's kind of hard to say. You can think of it as like a magical spirit that sort of lives in the computer and does something. And the thing that directs a process is a pattern of rules called a procedure.

So procedures are the spells, if you like, that control these magical spirits that are the processes. I guess you know everyone needs a magical language, and sorcerers, real sorcerers, use ancient Arcadian or Sumerian or Babylonian or whatever.

We're going to conjure our spirits in a magical language called Lisp, which is a language designed for talking about, for casting the spells that are procedures to direct the processes.

Now, it's very easy to learn Lisp. In fact, in a few minutes, I'm going to teach you, essentially, all of Lisp. I'm going to teach you, essentially, all of the rules.

And you shouldn't find that particularly surprising. That's sort of like saying it's very easy to learn the rules of chess. And indeed, in a few minutes, you can tell somebody the rules of chess.

But of course, that's very different from saying you understand the implications of those rules and how to use those rules to become a masterful chess player.

Well, Lisp is the same way. We're going to state the rules in a few minutes, and it'll be very easy to see.

But what's really hard is going to be the implications of those rules, how you exploit those rules to be a master programmer. And the implications of those rules are going to take us the, well, the whole rest of the subject and, of course, way beyond.

OK, so in computer science, we're in the business of formalizing this sort of how-to imperative knowledge, how to do stuff. And the real issues of computer science are, of course, not telling people how to do square roots. Because if that was all it was, there wouldn't be no big deal.

The real problems come when we try to build very, very large systems, computer programs that are thousands of pages long, so long that nobody can really hold them in their heads all at once.

And the only reason that that's possible is because there are techniques for controlling the complexity of these large systems. And these techniques that are controlling complexity are what this course is really about.

And in some sense, that's really what computer science is about. Now, that may seem like a very strange thing to say. Because after all, a lot of people besides computer scientists deal with controlling complexity.

A large airliner is an extremely complex system, and the aeronautical engineers who design that are dealing with immense complexity.

But there's a difference between that kind of complexity and what we deal with in computer science. And that is that computer science, in some sense, isn't real.

You see, when an engineer is designing a physical system, that's made out of real parts. The engineers who worry about that have to address problems of tolerance and approximation and noise in the system. So for example, as an electrical engineer, I can go off and easily build a one-stage amplifier or a two-stage amplifier, and I can imagine cascading a lot of them to build a million-stage amplifier. But it's ridiculous to build such a thing, because long before the millionth stage, the thermal noise in those components way at the beginning is going to get amplified and make the whole thing meaningless.

Computer science deals with idealized components. We know as much as we want about these little program and data pieces that we're fitting things together. We don't have to worry about tolerance. And that means that, in building a large program, there's not all that much difference between what I can build and what I can imagine, because the parts are these abstract entities that I know as much as I want. I know about them as precisely as I'd like.

So as opposed to other kinds of engineering, where the constraints on what you can build are the constraints of physical systems, the constraints of physics and noise and approximation, the constraints imposed in building large software systems are the limitations of our own minds.