loading...
Cover image for Using Machine Learning To Generate Human-Readable News Articles

Using Machine Learning To Generate Human-Readable News Articles

tra profile image Tariq Ali ・12 min read

TL;DR Abstract - I built ZombieWriter, a Ruby gem that will enable users to generate news articles by aggregating paragraphs from other sources. It can use either machine learning algorithms (Latent Semantic Analysis and k-means clustering) or randomization to generate human-readable articles. In this article, I demonstrate how ZombieWrtier can use machine learning to create a Markdown file containing 17 human-readable articles.

After finishing the demonstration and comparing the output to a randomization process, I then explain possible "future research" plans to improve the text generation process. I am not yet ready to claim that this technology is disruptive. But it's very close to being "production-ready".


Introduction

Machine learning is hot (to put it mildly). The paradigm of using data instead of code to program machines has been applied to solve a variety of real-world problems. It also been applied to the creative arts, with neural networks being used to generate paintings and music.

Text generation using machine learning is far less advanced though. Many famous attempts have been focused on training neural networks to generate text, and the output tends to be evocative in short bursts. However, these texts do not "scale" properly. Any coherence that a neural network generates by accident gets drowned out by the incoherence that it usually generates.

I have previously blogged about text generation. But I've been interested in "human-readable" text generation - text-generation that reaches the same quality standards of human literature. Machine learning algorithms were impressive, but they could not scale up well. So I usually ignored them in my blogging, in favor of "structural modeling" (which I defined as "writing code that 'encodes' the plot/style/characterization at a large scale").

Yet, technology marches onward. As machine learning becomes more popular and accessible, people are able to try different approaches and techniques to the problem of text generation. Some of these techniques actually were successful, when applied in conjunction with structural modeling. For example, Mike Lynch used neural networks to generate words for fantasy settings.

This march of technology should not be that surprising. An infinite number of monkeys with an infinite number of typewriters can write Shakespeare...similarly, an infinite number of programmers conducting an infinite number of experiments can make great discoveries.

A few months ago, I admitted that machine learning may play a role in the future of text generation, and made a prediction - "Statistical Approaches Will Be Part Of A 'Toolkit' In Text Generation". At the same time, I was putting the finishing touches on a secret project to use machine learning in my own text generation. (If you're curious in the nitty-gritty research details, here's some excerpts of a Slack conversation in September 2016 about my experiments.)

The final result of my project is ZombieWriter. According to its GitHub profile, ZombieWriter is "a Ruby gem that will enable users to generate news articles by aggregating paragraphs from other sources". You have to provide the paragraphs, but once it does so, ZombieWriter will arrange the paragraphs in a manner that appears as if they're news articles.

While you can download the gem and try it out yourself, knowing how it works may be more interesting. Here's the step-by-step guide for how ZombieWriter's Machine Learning algorithm works.

Step One - Latent Semantic Analysis

Latent Semantic Analysis is a machine learning algorithm that is used for determining how similar paragraphs are to each other. It was invented in the late 1980s, and had generally been used for "information retrieval".

LSA works by creating "bags-of-words" containing all the important words in a paragraph, and then comparing the different "bags-of-words" to determine their 'similarity'. Rather than deal with the math (because I don't quite understand it myself), here's a visual example, with four documents.

document bag-of-words
A cat, meow
B cat, feline, pet
C pet, food
D feline, lion

Document A and Document B are very similar to each other, because they both share the same word ("cat"). Document B and Document C are similar to each other because they share the same word ("pet"). Document B and Document D are similar to each other because they share the same word ("feline"). Etc., etc.

One interesting feature about LSA is that it is able to determine relationships between different words based on their presence in documents...which also helps it identify similarities.

For example, Document B contains the words "cat" and "feline", which suggest that both words are related to each other. Since Document D contains the term "feline", Document D is obviously similar to Document B (which contains "feline"). Document D is also similar to Document A, since Document A contains the word "cat", and "cat" is related to "feline".

For more information on the mathematics behind LSA, please review the blog post Latent Semantic Analysis in Ruby. I used Classifier-Reborn, a Ruby gem, for conducting LSA of my paragraphs.

Step Two - K-Means Clustering

Once you determine what paragraphs are similar to each other, you need to group similar paragraphs together, using an approach called "clustering". Clustering tends to be used for exploratory data analysis; you have a lot of data that you want to easily understand, so you have an algorithm separate the data into different groups.

There are many different ways to clustering data, but the most popular approach to clustering is "k-means clustering" (k stands for the number of clusters you want). k cluster centers are created, and the algorithm move the cluster centers around to minimize the distance between the cluster centers to each element within the cluster. Anil K. Jain wrote an excellent paper outlining the various different approaches to clustering, and provided excellent visualizations, including this example of k-means clustering with an arbitrary dataset:

Visual explanation of K-Means Clustering

I chose k-means clustering because it was a tried-and-tested approach to clustering, meaning that all its strengths and weaknesses are pretty well-documented. To implement k-means clustering, I used the KMeansClusterer ruby gem.

Step Three - Headlines

Finally, we need to generate a headline for each cluster of paragraphs, to make the clusters look less like a collection of similar paragraphs and more like news articles. There was three approaches I could take:

  • Find key words within the paragraphs and use them outright as the headline. (Highscore Ruby gem.)
  • Look for key phrases within the article and just regurgitate those key phrases. (N-Gram Ruby gem.)
  • "Summarize" the article by picking the "most important sentence" in the article. Use that summary as your headline (epitome Ruby gem.)

The last approach produced the most readable headline with the least amount of human intervention, so I chose that approach. However, I then realized that Classifier-Reborn also had its own built-in Summarizer. To avoid taking on unneeded dependencies, I decided to use Classifer-Reborn's summarizer instead of "epitome".

Classifier-Reborn uses LSA for its summarization. It first breaks up the "cluster" into sentences and then choose the sentence has the highest similarity when compare d to all the other sentences. So, let's look at my table again, this time treating each "document" as a sentence in a larger article.

document bag-of-words
A cat, meow
B cat, feline, pet
C pet, food
D feline, lion

Document B appear to have the most similarity, as it shares the same words as all the other documents in the set (A, C, and D). Therefore, it is likely that we would pick Document B and treat it as our headline.

A Demonstration of ZombieWriter

National Novel Generation Month (or NaNoGenMo) is a very interesting competition, with a lot of commentary over it. Rather than write our own commentary, why not simply reuse and remix existing commentary?

For this demonstration, I first prepared a CSV file nanogenmo.csv. This CSV file contains 90 paragraphs about the NaNoGenMo (National Novel Generation Month) competition, found all over the Internet. To ensure that we don't engage in plagiarism, we will always provide proper attribution to each source.

Then, I write a small script to parse this CSV, include it into ZombieWriter and then generate some news articles in an external file. ZombieWriter will also include proper citations, linking back to the original sources.

require 'zombie_writer'

#I like to use the smarter_csv gem as my CSV parser. It's a personal taste of mine.

require 'smarter_csv'

zombie = ZombieWriter::MachineLearning.new

array_of_paragraphs = SmarterCSV.process("nanogenmo.csv")

array_of_paragraphs.each do |paragraph|
  zombie.add_string(paragraph)
end

array = zombie.generate_articles

File.open("articles.md", "w+") do |f|
  array.each { |article| f.puts("#{article}\n- - -\n\n") }
end

Here is the Markdown file containing 17 different articles about NaNoGenMo. And here's an example generated article:

0 - Computers are coming for our jobs, as writers

Creative and artistic feats are often seen as the last refuge for human endeavor from the coming robot apocalypse. But if NaNoGenMo gains a foothold and improves, at least we'll all be well entertained in our unemployment.---"Computers Write Novels Faster Than You Do", Smithsonian Maganize

Computers are coming for our jobs, as writers. Spooky.---Tom Trimbath

But maybe this contest just reflects our evolution towards a more technological society. For the last 16 years November has seen “National Novel Writing Month” (or NaNoWriMo), a free event challenging amateur writers to compose a 50,000-word novel before December 1st. But two years ago it was suddenly joined by this companion event for artistically-inclined computer programmers, dubbed NaNoGenMo – drawing some suitably geeky jokes on Twitter. It's an irresistible challenge for a certain kind of creatively-inclined geek “This sounds like a great idea…” read one of the responses to the contest announcement. “How can anyone not take part?”---Computers Get Busy for National Novel-Generating Month

For programmers, there are many interesting things about NaNoGenMo even if no breakthroughs in AI are expected. (The point of an exercise like this isn't that it's done well, but that it's done at all. A month is not enough to build a robust system, but it is enough to experiment with and prototype new approaches for generating fifty thousand words of intelligible text. The value of a compressed time frame for experimentation is something that the participants of NaNoWriMo can well appreciate.)---Another Word:Let's Write a Story Together, MacBook

Limitations

While the generate articles' quality tends to be pretty good, there are a few problems with this demonstration that do require notice.

  1. It takes 15 seconds to generate those 17 articles. This probably isn't too bad for someone well-versed in machine learning, but it is somewhat shocking to a newbie who has to wait a while for results to pop up.

  2. The articles' sizes can vary, from being very short (1-2 paragraphs) to very long (11 paragraphs). I may have to break up some of the larger clusters so that the length of each article appear more equal.

  3. The headlines may need work. Some headlines can become pretty long (since they are using very long sentences), meaning that I tend to skip the headline half-way through and just start reading the generated article. It's likely that if I was to use this to generate my own articles, I would probably just handwrite my own headlines.

    One headline, in particular, was really odd - "For instance, here is Mr". This oddity can be traced to ClassifierReborn's approach to summarization - when it breaks an article up into individual sentences, it uses periods to indicate when a sentence has ended. "Mr.", obviously, had a period.

Randomization

ZombieWriter can also generate news articles by using randomization. Instead of bothering with machine learning to create clusters, ZombieWriter can just randomly pick paragraphs to put in each cluster. Using randomization in ZombieWriter is really easy:


zombie = ZombieWriter::Randomization.new

I wanted to test whether machine learning would produce better articles than randomization, so I also used the same nanogenmo.csv file with ZombieWriter::Randomization. Here's a Markdown file with 17 articles, using random generation.

The articles generated using machine learning appear to be of slightly higher quality than the articles generated using randomization. This may be because the articles generated using machine learning are likely to appear more 'coherent', as the paragraphs within the article usually discuss the same themes.

However, randomization is faster than machine learning, able to generate 17 articles in less than a second. Randomization also relies on less lines of code, meaning that it's much easier for a programmer to maintain.

That being said, the increased quality might outweigh the slower speed and the higher maintenance cost. So machine learning outweighs randomization, for now.

Future Research

There are certainly issues that have to be fixed, such as fixing bugs, writing unit tests, and porting the library over to Python. And that type of maintenance work matters a lot. There's also the need to explore new ways to improve the output of ZombieWriter, so that the generated texts becomes "production-ready" -- able to used in a wide variety of different industries.

Obviously, it's much easier to write about what should be done rather than to actually do it. Still, here's some ideas I'm interested in:

  • Gather the Source Data More Efficiently - ZombieWriter needs a CSV file containing the paragraphs it needs to remix as it sees fit. I was able to manually gather 90 paragraphs, but it was a tedious and time-consuming process. For ZombieWriter to successfully scale, we need to either pull paragraphs from an API (Reddit/Hacker News comments?) or write a new script to search online for news articles and extract paragraphs from these articles. Care must be taken to ensure that we provide proper attribution to whoever wrote the content.

    • For those who prefer to generate fiction instead of nonfiction, you could use a neural network to generate a bunch of nonsense text, and use that as the source data for ZombieWriter. ZombieWriter will then arrange the nonsense text into different clusters, and since the nonsense text in each cluster will share certain similarities, the resulting prose might appear more coherent.
  • Generate Better Headlines - Google experimented with headline generation using neural networks and blogged about their experience on August 24th 2016. What makes these generated headlines interesting is that they are attempting to engage in "abstractive summarization" (rephrasing the article) instead of "extractive summarization" (using exact words from the article). Their headlines therefore read more fluently than my current approach.

  • Rewrite The Generated Articles - Right now, ZombieWriter is simply quoting from other people. But while human writers do rely on quotes from other people, they don't usually just regurgitate the quotes. They want to usually restate what other people say "in their own words". Could ZombieWriter behave in the same manner? This is actually a "machine translation" problem -- the goal is to translate an existing work into a new 'manner of speaking'. People interested in this problem may want to look at Tensorflow-Shakespeare, which is a neural network that can translate modern English to Shakespearean English, and vice versa. A similar type of neural network may be useful here as well.

  • Explore Different Ways of Determining Similarities and Clustering - For example, probabilistic latent semantic analysis may produce much better results than normal Latent Semantic Analysis. Note though that this task is of a lower priority though, since any quality improvement that might be gained by improving Step 1 (similarity determination) and Step 2 (clustering) might be fairly limited and not worth the effort.

    • One interesting idea that I heard of recently is to "cluster" paragraphs based on dissimilarity instead of similarity - to take radically different paragraphs and place them right next to each other. Presenting various opinions on a topic may be a good way to counteract the "filter bubble" and introduce new ideas to the readers.

Of the ideas I'm interested in, "Gather the Source Data More Efficiently" is the highest priority. The best way to improve a text generator is to expand its corpus, after all. After acquiring enough data though, it will be time to look at ways of utilizing that data more effectively.

Appendix

Unlike my previous blog posts on computer-generated literature, this blog post is human-generated. This may be due to its nature as a technical "how-to" tutorial, where the order of paragraphs matters a lot. While text generation is slowly advancing, it still has a long way to go.

Ebook on Text Generation

In the past, every time I wrote about text generation, I would provide links to all my previous articles on dev.to. But this section just grew larger and larger. Using the tagging system wouldn't really work since the first two articles were written before the introduction of the tagging system.

Instead of providing more and more links, I created an eBook that contained all my articles...as well as a link to the original sources. The eBook ("An Introductory Guide to Computer-Generated Works") is hosted on GitBook. It provides links to the various articles I have ever written on computer-generated texts, as well as the full text of those articles if you like to use GitBook's UI or want to read these articles offline. You may download/star/subscribe to the eBook using this link.

I hope you like it.

Posted on by:

tra profile

Tariq Ali

@tra

Tariq Ali is an aspiring web developer dedicated to producing interesting applications and websites for businesses. Before entering into web development, Tariq served as a journalist, real estate agent, and college instructor. He has a special talent in writing and researching effectively.

Discussion

markdown guide
 

Okay, you find similar paragraphs from different texts with a reasonable approach. But the title is really misleading: It's not text generation when you just concatenate those paragraphs of different texts. The article is well written, but what you're doing is quite questionable, i.e. you also don't really seem to understand what is happening there.

 

I appreciate your concern.

When I say "text generation", I'm referring to all processes by which a computer is able to write out text at least semi-autonomously (so not just typing words out on a keyboard). I admit that this is an exceedingly broad definition (as it includes Markov chains, neural networks, world models, Mad-Lib templates, etc., etc.), but I like this "big tent" definition because trying to get computers to write human-readable text is a very difficult task regardless of the method chosen.

I may want to host a discussion on dev.to on what should be the proper definition of text generation though, and whether the definition should be made more restricted to avoid people getting misled, or if the "big tent" definition should be retained. I actually like your term "concatenate" to describe certain approaches to text generation, since it makes it clear that the computer isn't being "creative" in writing out this text, but simply stitching together other people's texts.

As for "don't really seem to understand what is happening there", it is true that I may have made a bad assumption. My reason for declaring that the program generated news articles is that the final text that has been generated/concatenated appears very similar to that of a news article. Humans has a tendency to engage in have a tendency to see patterns, even when none exist -- apophenia. So if humans see something that is similar to that of a news article, they will treat it as if it was a news article. At least, that was my assumption.

This was a bad assumption, as every human is different and just because I view the text as an article doesn't mean that every human will view the text as an article. Feedback from the project seems to be very positive (suggesting that most people did treat the text as articles), but one person has pointed out that they saw the generated/concatenated text as a collage instead of an article. This suggest that I need to be more cautious in the future, and if I am to make assumptions about humans, I need to study human psychology in depth first.

It is very important to prevent people (including me) from inadvertently hyping up technology, thereby leading to a hype cycle and an inevitable AI Winter. Therefore, I really appreciated the feedback that you given me and will definitely implement it in my next experiment. Thanks.