DEV Community

Cover image for Summarizing Large Documents for Machine Learning Modeling
Mohsin Ashraf for Traindex

Posted on • Updated on

Summarizing Large Documents for Machine Learning Modeling

Information is growing exponentially every passing day thanks to the internet. It has connected humanity from all the corners of the world. According to Forbes, 2.5 quintillion bytes of data are created every day, and the pace is accelerating. It includes all kinds of data: text, images, videos, and transactions, etc. Text data is the largest shareholder among these data. This text data can be a conversation between two persons, which can be of small sentences, or it can be intellectual property data, for example, patents, which can be up to millions of words.

Handling smaller datasets with fairly small to medium-length documents is not a big challenge anymore, thanks to the power of deep learning. The problem comes when you have to deal with large-sized documents ranging from a few thousand words to millions of words such as patents and research papers. This is still challenging for deep learning methods to deal with such larger documents. It is hard for even state-of-the-art deep learning methods to capture the long-term dependencies in these documents. Hence, these huge documents require some special techniques to deal with them.

At Traindex, we are working with intellectual property and providing effective and efficient search solutions on patents. Patent analysts might want to know what other patents exist in the same domain when filing a new patent. They may want to find prior art to challenge a claim in an existing patent. There are numerous use cases that better patent search helps solve.

We are dealing with millions of patents, each containing thousands of words in it and some patents even reaching millions of words! Dealing with such a massive dataset with enormously large documents is a big challenge. These patents are not only lengthy, but they are intellectually intensive and have long-term dependencies. Deep learning alone will fail here to capture the proper semantics of such humongous documents. There need to be some very specialized techniques to deal with such large documents. We have developed some preprocessing techniques that help us reduce the size of the documents and keep the sense of the document pretty much the same.

We use an extractive summarizer that summarizes the patent by first going through the whole patent, figuring out the important sentences in it, and then dropping off the patent's least important sentences. The summarizer uses two measures to figure out which sentence is essential: first, how many stopwords are present in the sentence (which reduces the importance of the sentence), second, how many important topics does a sentence contain with respect to the overall topics discussed in the patent (which increases the importance of the sentence). Then we use a simple threshold value for deciding which sentences to keep and which to drop. By changing the threshold's value, we can change the summary length, such that the summary contains the most important information about the patent. The following figure illustrates this point.


The above image shows you the scores of the sentences in a patent. The horizontal red line is the threshold of importance for keeping or throwing away the sentences. The blue bars are the importances of the sentences below our defined threshold; hence we will drop them, and the result would look as follows.


These are the sentences that we are going to keep for our summarized patent. We can change the threshold line and get the summary's different lengths based on our needs. The flow of the overall process is given below.

Untitled Diagram-1

We have tested this approach, and it has improved our search index's overall performance on patents. It tackles many problems for deep learning algorithms, especially if you are using some pre-trained models like Universal Sentence Encoder or Bert, which accept a limited number of words for any document. If you increase the length, you will run into errors. You can apply this summarization technique for all kinds of embedding algorithms with some limits regarding the input document's length.

Top comments (0)