DEV Community

Cover image for Extract texts from PDFs.
Abayomi Ogunnusi
Abayomi Ogunnusi

Posted on

7 1

Extract texts from PDFs.

Finding, screening, recruiting, and training job applicants, as well as administering employee-benefit programs, are the responsibilities of hiring managers and human resource (HR).
At times, the process may necessitate extracting their information in the most computerized and automated manner possible.

We'll learn how to extract text from PDF using the pdf-parse npm lib in this short post.

Setup

npm init -y to start your node project
npm i pdf-parse
Add your pdf file

This is how your folder structure should look.

Image description

  • Here's the code base
const fs = require("fs");
const pdfParse = require("pdf-parse");

const pdfFile = fs.readFileSync("test.pdf");

pdfParse(pdfFile).then(function (data) {
  console.log(data.numpages);
  console.log(data.text);
  console.log(data.info);
});

Enter fullscreen mode Exit fullscreen mode
  • Other available options
    // number of pages
    console.log(data.numpages);
    // number of rendered pages
    console.log(data.numrender);
    // PDF info
    console.log(data.info);
    // PDF metadata
    console.log(data.metadata); 
    // PDF.js version
    // check https://mozilla.github.io/pdf.js/getting_started/
    console.log(data.version);
    // PDF text
    console.log(data.text); 
Enter fullscreen mode Exit fullscreen mode
Run your code with this command: node index

Result:
Image description

The 2 highlighted in green represents the number of text as indicated in our code.


Basic Usage with HTTP

We will install 2 additional packages multer and crawler-request

const express = require("express");
const pdf = require("pdf-parse");
const crawler = require("crawler-request");
const multer = require("multer");

var upload = multer();

const app = express();
const port = process.env.PORT || 3434;

// Body parser middleware
app.use(express.json());
app.use(express.raw());


app.post("/upload-pdf", upload.single("file"), (req, res) => {
  console.log(`Request File: ${JSON.stringify(req.file)}`);

  let buff = req.file.buffer;

  pdf(buff).then((data) => {
    // PDF text
    console.log(data.text);
    res.send({ pdfText: data.text });
  });
});

app.listen(port, () => {
  console.log(`app started on localhost:${port}`);
});


Enter fullscreen mode Exit fullscreen mode
Let's test with postman

Image description

Result:
Image description

Discuss

What are the other ways you can use to extract text from PDF other than the aforementioned

Resources

pdf-parse
Dev Odyssey

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

Top comments (1)

Collapse
 
sarajohn130 profile image
sara john

I tried using it in a next.js api route and it wouldn't even load when I wrote "const pdfParse = require("pdf-parse");".

This library was last updated 5 years ago. You should delete this article. snyk.io/advisor/npm-package/pdf-parse

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

👋 Kindness is contagious

Immerse yourself in a wealth of knowledge with this piece, supported by the inclusive DEV Community—every developer, no matter where they are in their journey, is invited to contribute to our collective wisdom.

A simple “thank you” goes a long way—express your gratitude below in the comments!

Gathering insights enriches our journey on DEV and fortifies our community ties. Did you find this article valuable? Taking a moment to thank the author can have a significant impact.

Okay