How to run NLP on a PDF file?

#nlp #node #javascript

On several occasions we need to extract information from PDF documents. The first step is to convert the PDF document into raw text using a PDF parser. In the following example, we use the pdf-parse NPM package to achieve this. Once we have the raw text, it can be passed to winkNLP's readDoc method to process the text. The doc object returned by this API can be used to access a variety of information such as named entities, sentences containing negation, overall or sentence wise sentiment score and many more. We have illustrated extraction of named entities here — also known as NER.

// Load wink-nlp package & helpers.
const winkNLP = require( 'wink-nlp' );
const its = require( 'wink-nlp/src/its.js' );
const model = require( 'wink-eng-lite-model' );
const nlp = winkNLP( model );

const fs = require('fs');
const pdf = require( 'pdf-parse' );

// Read PDF file.
let dataBuffer = fs.readFileSync( './sample.pdf' );

// Parse & extract entities from the dataBuffer.
pdf( dataBuffer ).then( function( data ) {
  const doc = nlp.readDoc(data.text);
  console.log( doc.entities().out( its.detail ) );
});

The above code will read the PDF file located in the current directory and print all the named entities detected along with their type i.e. DATE, TIME, MONEY, EMAIL and many more. Each entity is in the form of a Javascript object containing two properties — value and type; for example, {value: 'March 15, 1972', type: 'DATE' }.

winkNLP’s English language lite model uses a pre-trained state machine to recognize named entities.

This could be useful in extraction of meaningful information from a resume, financial document or a complete book.

Photo by Annie Spratt on Unsplash