Nowadays, large amounts of data are stored digitally, which is accessed by a computer for analysis. However, manually analyzing the data through human efforts is an expensive task. When the number of documents is high, we need more workforce. So, it is necessary to provide a better mechanism to extract useful information fast and most effectively. Therefore, various computerized methods have been proposed to do an automatic analysis of documents to provide a better summary than a human-generated summary with less effort and which is less expensive. Text summarization reduces the text of a document to provide a concise representation that contains the most important sentences or keyphrases of the original document for faster data analysis.
Mainly, automatic text summarization categorized into two:
Extractive Summarization, generates the summary by including the word, sentences, or key-phrases as it is in the original document while, Abstractive Summarization, generate the summary by taking the semantic meaning of the document. It May or may not include the word, sentences, or key-phrases from the original document, it can add new words, sentences also.
Basically, Extractive Text Summarization can be done in the following steps:
Tokenization and Preprocessing:
Tokenization is the process of breaking the document into words, knows as word tokenization or sentences, known as sentences tokenization. Next, removal of the stop words, stemming or lemmatization to get a base form of the word, come under preprocessing. We can use the NLTK Python library for tokenization and preprocessing. Note that before further processing of input data make sure you have indexed the sentences otherwise on the final summary it would be difficult to rearrange them into a sequence.
Vector Representation:
Real-world data can’t be represented numerically, Therefore we need to apply some transformation in the input space to convert data into feature space. So, we can do numerical analysis on the input data. Text data is high-dimensional data that can be represented in different forms such as word count vector or TF-IDF vector.
Sentences Scoring:
We have represented sentences in the form of vectors. Now, find the similarity between all the vectors. To find the similarity between vectors, Jaccard Similarity, Euclidean distance, Cosine similarity metrics are used. Based on their similarity, we can assign a score to each sentence using TextRank, LexRank, etc.
Sentences Extraction:
Each sentence has a score, based on their score sort them and pick the top n sentences from the source document. So, we get the human-readable form that will be included in our summary and rearrange them, based on their index.
Top comments (0)