loading...

Build a quick Summarizer with Python and NLTK

davidisrawi profile image David Israwi Updated on ・4 min read

If you're interested in Data Analytics, you will find learning about Natural Language Processing very useful. A good project to start learning about NLP is to write a summarizer - an algorithm to reduce bodies of text but keeping its original meaning, or giving a great insight into the original text.

There are many libraries for NLP. For this project, we will be using NLTK - the Natural Language Toolkit.

Let's start by writing down the steps necessary to build our project.

4 steps to build a Summarizer

  1. Remove stop words (defined below) for the analysis
  2. Create frequency table of words - how many times each word appears in the text
  3. Assign score to each sentence depending on the words it contains and the frequency table
  4. Build summary by adding every sentence above a certain score threshold

That's it! And the Python implementation is also short and straightforward.

What are stop words?

Any word that does not add a value to the meaning of a sentence. For example, let's say we have the sentence

A group of people run every day from a bank in Alafaya to the nearest Chipotle

By removing the sentence's stop words, we can narrow the number of words and preserve the meaning:

Group of people run every day from bank Alafaya to nearest Chipotle

We usually remove stop words from the analyzed text as knowing their frequency doesn't give any insight to the body of text. In this example, we removed the instances of the words a, in, and the.

Now, let's start!

There are two NLTK libraries that will be necessary for building an efficient summarizer.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

Note: There are more libraries that can make our summarizer better, one example is discussed at the end of this article.

Corpus

Corpus means a collection of text. It could be data sets of poems by a certain poet, bodies of work by a certain author, etc. In this case, we are going to use a data set of pre-determined stop words.

Tokenizers

Basically, it divides a text into a series of tokens. There are three main tokenizers - word, sentence, and regex tokenizer. For this specific project, we will only use the word and sentence tokenizer.

Removing stop words and making frequency table

First, we create two arrays - one for stop words, and one for every word in the body of text.

Let's use text as the original body of text.

stopWords = set(stopwords.words("english"))
words = word_tokenize(text)

Second, we create a dictionary for the word frequency table. For this, we should only use the words that are not part of the stopWords array.

freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1

Now, we can use the freqTable dictionary over every sentence to know which sentences have the most relevant insight to the overall purpose of the text.

Assigning a score to every sentence

We already have a sentence tokenizer, so we just need to run the sent_tokenize() method to create the array of sentences. Secondly, we will need a dictionary to keep the score of each sentence, this way we can later go through the dictionary to generate the summary.

sentences = sent_tokenize(text)
sentenceValue = dict()

Now it's time to go through every sentence and give it a score depending on the words it has. There are many algorithms to do this - basically, any consistent way to score a sentence by its words will work. I went for a basic algorithm: adding the frequency of every non-stop word in a sentence.

for sentence in sentences:
    for wordValue in freqTable:
        if wordValue[0] in sentence.lower():
            if sentence[:12] in sentenceValue:
                sentenceValue[sentence[:12]] += wordValue[1]
            else:
                sentenceValue[sentence[:12]] = wordValue[1]

Note: Index 0 of wordValue will return the word itself. Index 1 the number of instances.

If sentence[:12] caught your eye, nice catch. This is just a simple way to hash each sentence into the dictionary.

Notice that a potential issue with our score algorithm is that long sentences will have an advantage over short sentences. To solve this, divide every sentence score by the number of words in the sentence.

So, what value can we use to compare our scores to?

A simple approach to this question is to find the average score of a sentence. From there, finding a threshold will be easy peasy lemon squeezy.

sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]

# Average value of a sentence from original text
average = int(sumValues/ len(sentenceValue))

So, what's a good threshold? The wrong value could give a summary that is too small/big.

The average itself can be a good threshold. For my project, I decided to go for a shorter summary, so the threshold I use for it is one-and-a-half times the average.

Now, let's apply our threshold and store our sentences in order into our summary.

summary = ''
for sentence in sentences:
        if sentence[:12] in sentenceValue and sentenceValue[sentence[:12]] > (1.5 * average):
            summary +=  " " + sentence

You made it!! You can now print(summary) and you'll see how good our summary is.

Optional enhancement: Make smarter word frequency tables

Sometimes, we want two very similar words to add importance to the same word, e.g., mother, mom, and mommy. For this, we use a Stemmer - an algorithm to bring words to its root word.

To implement a Stemmer, we can use the NLTK stemmers' library. You'll notice there are many stemmers, each one is a different algorithm to find the root word, and one algorithm may be better than another for specific scenarios.

from nltk.stem import PorterStemmer
ps = PorterStemmer()

Then, pass every word by the stemmer before adding it to our freqTable. It is important to stem every word when going through each sentence before adding the score of the words in it.

And we're done!

Congratulations! Let me know if you have any other questions or enhancements to this summarizer.


Thanks for reading my first article! Good vibes

Posted on by:

davidisrawi profile

David Israwi

@davidisrawi

Recent graduate from the University of Central Florida in Computer Science. Looking for entertainment!

Discussion

markdown guide
 

Excellent post, you are absolutely amazing ❤️
I got one question though, when adding up the sentenceValues, why would you like the key in the sentenceValue dictionary to only be the first 12 characters of the sentence? I mean it might cause some troubles if the sentence is lower than 12 characters or if two different sentences starts with the exact same 12 characters.

I assume you did it as a way to reduce overheat, but to be honest. Perfomance wise I don't think the difference would be that significant, I would much rather prefer:

  • Added readability (as you won't have to think about the [:12]
  • Errors with sentences less than 12 characters
  • Issues with two different sentences starting with the same 12 chars.

As a sacrifies for a tiny performance increase.

I would love to hear your opinion on this matter.

If anyone got any errors running the code, copy paste my version.

That said, it does not work properly, it has some flaws, I tried to summarize this article as a test. Here is the result: (The threshold is: (1.5*average) )

"For example, the Center for a New American Dream envisions "... a focus on more of what really matters, such as creating a meaningful life, contributing to community and society, valuing nature, and spending time with family and friends."

 

Thank you very much, Sebastian!

I agree with you -- having the whole sentence as the dictionary key will bring a better reliability to the program compared to the first 12 characters of the sentence, my decision was mainly regarding the overheat, but as you said: it is almost negligible. One bug that I would look for is the use of special characters in the text, mainly the presence of quotes and braces, but this is an easily fixable issue (I believe using the three quotes as you are currently doing will avoid this issue)

I summarized the same article and got the following summary:

It boldly proclaims: "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. President Lincoln extended the American Dream to slaves with the Emancipation Proclamation. How the American Dream Changed Throughout U.S. history, the definition of happiness changed as well. After the 1920s, many presidents supported the idea of the Dream as a pursuit of material benefits. While running for President in 2008, Hillary Clinton proposed her American Dream Plan. Did the Great Recession Create a New American Dream? Some people think the Great Recession and rising income inequality spelled the end of the American Dream for many. Instead, many are turning to a new definition of the American Dream that better reflects the values of the country for which it was named. For example, the Center for a New American Dream envisions "... a focus on more of what really matters, such as creating a meaningful life, contributing to community and society, valuing nature, and spending time with family and friends." Financial adviser Suze Orman described the new American Dream as one "... where you actually get more pleasure out of saving than you do spending. (Source: Suze Orman on the New American Dream, ABC.) Both of these new visions reject the American Dream based on materialism. But perhaps there is no need to create a New American Dream from scratch.

Feel free to use my version for comparison!

How short your summary was may be a result of the way you are using the Stemmer, I would suggest testing the same article without it to verify this. Besides that, your code is looking on point -- clean and concise. If you are looking for ways to improve your results, I would suggest you explore the following ideas:

  • Having a variable threshold
  • Using TFIDF instead of our word value algorithm (not sure if it'll bring better results but worth the try)
  • Having some kind of derivated value added from the previous sentence for consistency

Thanks for the suggestion!

 

Cool website you got yourself there!

I got a question I forgot to ask. Why do you turn the 'stopwords' list into a set()? First I thought it was because you properly intented to remove duplicate items from the list, but then it stroke me.. Why would there be duplicate items in a corpus list containing stop words? When I compared the length of the list before and after turning it into a set. There was no difference:

len(stopwords.words("english") == len(set(stopwords.words("english")))
Outputs: True

Tracing the variable throughout the script, I most admit, I can not figure out why you turned it into a set. I assume it is a mistake?
Or do you have any specific reason for it?

  • by the way, thanks for the TFIDF suggestion, I am currently working on improving the algorithm by implementing the tfidf concept.

Hmm, I believe the first time I used the list of stop words from NLTK there were some duplicates, if not I am curious too, lol. It may be time to change it to a list.

Thanks for the note!

If you ever try your implementation using TFIDF, let me know how it goes.

 

Excellent post!
@davidisrawi can you please help me with text extraction (3-4 Keywords) from each paragraph from an article.

I went through your article i got stuck with an Error "string index out of range".

sentenceValue[sentence[:12]] += wordValue[1]
IndexError: string index out of range

I have tried changing [sentence[:12]] to 7,8,9 but unable to resolve my error.

Please help regarding this

 

Thank you very much Dhairya!

This bug could happen whenever you list of sentences contains one of length < 12. A good workaround is to remove the [:12] index completely and use the whole sentence as the sentenceValue keys. Does that make sense? So instead it would be:

sentenceValue[sentence] += wordValue[1]

Let me know if that fixes the problem!

 

I have changed it but it is still giving me some KeyError: and showing 3-4 starting lines from my text.

How to solve this error ?

Hm sounds like you may have forgotten to remove the [:12] index from the other parts of your code where you use sentenceValue, maybe that is the issue? If not, feel free to share a snippet of your code so we can be on the same page.

Hi, I am still facing the same error index out of range...help me on this

The code should be like this..
for sentence in sentences:
for wordValue in freqTable:
if wordValue in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freqTable[wordValue]
else:
sentenceValue[sentence] = freqTable[wordValue]

 

Thanks @david Israwi for this simple and interesting text summarizer program.
I see and analyze your code.
The most error i found is index out of range and most of the people seem to have the same error a lot.
The one thing i am confuse in this part of code:

summary = ''
for sentence in sentences:
     if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.5 * average)):
          summary += " " + sentence

why and how the only 1.5 average is used.
How about the the large one line text instead it not summarize it.
For example:

For example, the Center for a New American Dream envisions "... a focus on more of what really matters, such as creating a meaningful life, contributing to community and society, valuing nature, and spending time with family and friends and even altogether"

I am using python 3 and i resolve the index out of range error as:

for sentence in sentences:
     for index, wordValue in enumerate(freqTable, start=1):
          if wordValue in sentence.lower(): # index[0] return word
               if sentence in sentenceValue:  
                    sentenceValue[sentence] += index # index return value of occurence of that word
                    #sentenceValue.update({sentence: index})
                    #print(sentenceValue)
               else:
                   # sentenceValue[sentence] = wordValue
                    sentenceValue[sentence] = index
                    #print(sentenceValue)

Thanks a lot! This post is really helpful! If you have other resources including making chatbot can be really helpful to me.
I am little bit interesting about how to implement the Text Summarizer using machine learning model. I am looking for this too...
You can directly send information at sushant1234gautam@gmail.com

 

sentenceValue[sentence[:12]] += wordValue[1]
string index out of range

 

Hey! You may have a sentence that is lower than 12 characters. In this case, you can set the index of the word value to sentence[:10], or a lower number depending on your shortest sentence.

Lowering the number of characters used to hash the sentence value can bring some issues -- two sentences with the same 7,8,9 starting characters will then store/retrieve their value from the same index on the dictionary, that's why it's important to keep the sentence length for hashing as high as you can

 

any number i use gives the same error

sentenceValue[sentence[:2]] += wordValue[1]

IndexError: string index out of range

Interesting. For debugging it, I would print all your sentences and find if there's an empty one (or a very short one), I think that may be the issue.

Let me know if that works or if you found the issue

i am also facing string index out of range problemm.,,what is the issue

You may have a string that is less than your string length sentence[:2]. I would recommend printing the strings and see if this is the case

i have solved it..it was acctually puncutation problem..in my case...i just handle dot(.) character while giving words as value..

Could you please help me , i'm facing same problem here and i can't handle it , tahnk you.

Is this your error?

IndexError: string index out of range

If so, potential solutions could be:

  • Having a string that is less than your string length sentence[:2]. I would recommend printing the strings and see if this is the case.
  • Punctuation problem, like phabonislam explained above.

If that doesn't solve it, let me know!

It is still giving me an error when the text is > than 12 characters long and the sentance is (when printed through the loop) "You notice a wall of text in twitch chat and your hand instinctively goes to the mouse.", which is the first line in the paragraph. I found that even when you take out the range the same error occurs.

The bug may be in how you are storing your sentences, make sure you print out the sentences as you store them instead of when you retrieve them, hopefully, that'll help you find the issue. If not, let me know if I can help!

 

This isn't working right for me and I think it comes down to wordValue[0] not working for me the way you said. Do you know why that could be?

Like if I do:
for wordValue in freqTable:
print (wordValue[0])

I only get the first letters:
q
b
f
j
m
.
s
b
s
l

 

It seems like your bug comes from separating the paragraphs into letters instead of words.

The program should do the following commands in the respective order:

  • Separate the body of text into sentences using sentence tokenizer
  • Separate sentences into words using the word tokenizer
  • Find frequency of each word while building bag of words

I wouldn't be able to know in which step the bug is, but it seems as if you are finding the frequency of each letter instead of each word, make sure you are keeping track of your arrays by printing them through your code, seems like you're almost there

 

I too am confused about this. According to my understanding, when we use the in operator in a dictionary, it only iterates through the keys. Therefore it would make sense that the program is printing only the first letter.

I guess to get the key-value pairs, we need to use the items() function as:
for wordValue in freqTable.items():

 

Great post David,

I have been trying to wrap my head around machine learning and NLP for a few months now. Developing intuition has been a slow process. Article like yours are a sources of "aha moments"/. I am trying to build a blog post summary app. Being a newbie I am using an API (AYLIEN) and following this summary generator tutorial. Having something working gives me motivation to read in-depth articles.

 

Thanks for your comments Vikram, best of luck with the summary app!

 

In Python, a string is a single-dimensional array of characters. The string index out of range means that the index you are trying to access does not exist. In a string, that means you're trying to get a character from the string at a given point. If that given point does not exist , then you will be trying to get a character that is not inside of the string. Indexes in Python programming start at 0. This means that the maximum index for any string will always be length-1. There are several ways to account for this. Knowing the length of your string (using len() function)could certainly help you to avoid going over the index.

 

Hello sir, could you suggest a way to make the summarizer more efficient. Sometimes a few sentences with lower sentence values can be very important for the summary. In that case if few leave those out, the summary may not make sense

 

That's a good point. I think what you might be referring to is some kind of an adjacency value - this sentence might be worth more than we think because it's next to this really important sentence.

Another aspect you could change in the scoring algorithm is the use of TF-IDF. Let me know if you end up using it, I would like to see how that would look like

 

I'm getting a "syntax error" on any text that I try to pass through the program, how would I go about running the text i want to summarize through this program?

 

I would try to have the text be converted to UTF-8 before sending it through, maybe there are special characters or accents throwing it off

 

Hi..
I have a problem with that :

sumValues += sentenceValue[sentence]
TypeError: unsupported operand type(s) for +=: 'int' and 'str'

 

Hi Viqi. Seems like you are storing a string in your sentenceValue dictionary instead of an actual value, it is supposed to be an int instead. Fixing that may solve the problem!

 

where we have import text file to run this?

 

That would be up to you!

In my implementation, I put everything in one method, so then I can just run it through the command line passing the actual string of text. Having said that, it totally depends on your use or implementation, in some cases it, might be worth to receive the text file instead

Here is my imp if you want to take a look at it: github.com/DavidIsrawi/SummarizeMe...

 
Sloan, the sloth mascot Comment marked as low quality/non-constructive by the community View code of conduct

Hi,

I'm pretty new to coding in general and writing my thesis in the usage of text summarization in the marketing context.
First: Thanks for this awesome explanation, it helped me a lot!!
My question is, where is the difference to the Luhn algorithm. It seems to me that there is not such a big difference. Or did I confuse something?