DEV Community

Cover image for What is NLP and What is Abstractive & Extractive Text Summarization
X-Byte Enterprise Crawling
X-Byte Enterprise Crawling

Posted on

What is NLP and What is Abstractive & Extractive Text Summarization

Introduction

Summarization is amongst the most general tasks, which we do in NLP or Natural Language Processing. Because of the new content produced by billions of users every day, we are flooded with a huge amount of data daily. Humans could only use a limited amount of data as well as require a way of filtering out wheat from a chaff as well as find data, which matters the most. Text summarization may also help in achieving it for textual data. We may separate the signals from noise as well as take important actions.

In this blog, we would explore different methods of applying the jobs and a few learnings that we have assumed for NLP. We believe it will be useful for others who might like to apply basic summarization within data science pipelines to solve various business problems.

Python offers some outstanding modules and libraries to do Text Summarization. We would offer an easy example of producing Extractive Summarization with HuggingFace and Gensim modules in the blog.

How to Utilize Summarization?

It might be appealing to utilize summarization for texts to get useful data from them as well as spend lesser time for reading. Although, NLP summarization has become a successful use case in some regions.

Text summarization works wonderfully well in case, the text has many raw facts as well as could be used for filtering important data from them. Different NLP models could summarize longer documents as well as represent them with small and easy sentences. Factsheets, News, and mailers come under these groups.

Although, for texts in which every sentence creates text summarization may not work well every time. Medical texts and research journals are very good text examples where summarization could not be extremely successful.

In conclusion, if we analyze summarizing fiction, the summarization methods may work well. Although, it could miss the style as well as tone of text, which an author has tried to explicit.

Therefore, Text summarization is useful only in different use cases.

Types of Summarization

You can have two types of Text Summarization: Extractive and Abstractive.

Extractive

Extractive summarization works like that. It uses the text and ranks the sentences as per the understanding as well as text relevance and gives you with most vital sentences.

This technique does not make new phrases or words, this just takes an already available phrases and words as well as presents them. You can visualize this like taking page of the text as well as marking the most significant sentences using the highlighter.

Abstractive

On the other hand, Abstractive Summarization guesses the meaning about the entire text as well as represents the meaning for you.

It makes phrases and words, make them together in the meaningful way, and adds the most significant facts available in the text. That’s how, Abstractive Summarization methods are more difficult than Extractive Summarization methods as well as are also more expensive.

Comparison: Extractive Summarization and Abstractive Summarization
The best way of illustrating both types is by an example. We are running an Input Text given below using both kinds of summarizations and their results are given shown below:

Input Text

Huawei from China has beaten Samsung Electronics as world’s largest mobile phones seller in the 2nd part 2020, by selling 55.8 million mobile phone devices in comparison to Samsung’s 53.7 million, as per the data given by a research company, Canalys. Although Huawei’s sales has fallen 5% from the similar quarter one year before, Samsung from South Korea has seen a bigger drip of 30% because of disruption from Coronavirus in the main markets like Brazil, US, and Europe. Huawei’s foreign shipments fell by 27% in Q2 from one year earlier, however, the company had increased its supremacy in the China market that has helped in faster recovery from COVID-19 by selling more than 70% of their phones there. However, Huawei’s position of being a No. 1 seller might prove to be short-lived when other markets get recovered, said a senior positioned Huawei employee!

Extractive Summarization Result

As Huawei’s sales were fallen by 5% from the similar quarter one year earlier, Samsung from South Korea has posted a larger drip of 30%, due to disruption from Coronavirus in main markets like Brazil, US, and Europe, according to Canalys. Huawei’s foreign shipments has fallen by 27% in the Q2 from one year earlier, however, the company has increased its supremacy in the Chinese market that has been quicker to get recovered from COVID-19 as well as where that now sells more than 70% of its mobile phones.

Abstractive Summarization Result

Huawei has beaten Samsung as the world’s largest seller of the mobile phones during the 2nd part of 2020. Huawei has sold 55.8 million mobile devices in comparison of 53.7 million mobile devices sold by Samsung of South Korea. Overseas shipments have fallen by 27% in the Q2 from one year earlier, however, the company has increased its supremacy in the Chinese market. However, the No. 1 position as a seller might prove to be short-lived when other markets get recovered, according to a senior employee of Huawei.

Extractive Text Summarization with Gensim
We need to import the necessary functions and libraries:

from gensim.summarization.summarizer import summarize
from gensim.summarization.textcleaner import split_sentences

We save the blog content in a variable named Input (stated above). After that, we need to pass that to a summarized function, the 2nd parameter getting the ratio we wish the summarized texts to be. We picked it as 0.4, or a summary would be about 40% of original text.

summarize(Input, 0.4)

Output

Although Huawei’s sales has fallen by 5% from the similar quarter one year earlier, Samsung from South Korea has posted a larger drip of 30%, because of disruption from Coronavirus in main markets like Brazil, US, and Europe, as per Canalys. Huawei’s foreign shipments has fallen by 27% in 2nd part from the earlier year, however, the company had increased its supremacy in Chinese market that has helped in quicker recovery from COVID-19 and it is now selling more than 70% of its mobile phones in the local market.

With a parameter split=True, you may observe the output like a listing of the sentence.

Gensim summarization deals with TextRank algorithm. TextRank ranks texts as well as provides you the most significant ones back.

Extractive Text Summarization with Huggingface Transformers
Here, we have used the same blog to summarize however, this time, we have used a transformer model taken from Huggingface,

from transformers import pipeline

Also, we need to load a pre-trained summarization model in the pipeline:

summarizer = pipeline("summarization")

After that, to utilize this model, we have passed the text, with minimum length, as well as maximum length parameters to get the following outputs:

summarizer(Input, min_length=30, max_length=300)

Output

Huawei from China has beaten Samsung Electronics to become the world’s largest mobile phones seller in the 2nd part of 2020, selling55.8 million mobile devices compared to 53.7 million mobile devices from Samsung. Samsung has seen a larger drop of 30% due to disruption from Coronavirus in main markets like Brazil, US, and Europe.

From Where You Can Get Data?

You can extract news websites for getting data to try both the summarization methods. In case, you aren’t interested in building web scrapers to gather data, you may try News Data Scraping API from X-Byte

Discussion (0)