Verify PDF contents using Playwright and pdf2json

#playwright #testautomation #automatedtests #typescript

In this tutorial we will use Playwright inconjuction with pdf2json to validate contents of a pdf file. This is very common task that you will normally encounter when creating end to end automated tests.

The pdf file we will use for this example is plain old textual based pdf containing 6 pages. For simplicity, I have stored this file pdf_sample.pdf into the root folder of the project.

Our goals are:

validate the meta informaiion (keywords:"Standard Fees and Charges, 003-750, 3-750") contained within the file
ensure the pdf file indeed has 6 pages
assert whether the PDF file contains the correct text "When we may charge fees"

First up, you will need to add pdf2json to your project using yarn (or npm):

yarn add pdf2json -D

Import pdf2json into your spec file and create the initial scaffolding for our tests:



import PDFParser from 'pdf2json';
import { test, expect } from '@playwright/test';

test.describe('assert PDF contents using Playwright', () => {
  test.beforeAll(async () => {
  })

  test('pdf file should have 6 pages', async () => {
  });

  test('contains the correct subheading text', async () => {
  });

  test('shows the correct meta information (keywords)', async () => {
  });
});

Create a simple helper function that does the heavy lifting of parsing and loading the pdf contents into a variable:



async function getPDFContents(pdfFilePath: string): Promise<any> {
  let pdfParser = new PDFParser();
  return new Promise((resolve, reject) => {
    pdfParser.on('pdfParser_dataError', (errData: {parserError: any}) =>
      reject(errData.parserError)
    );
    pdfParser.on('pdfParser_dataReady', (pdfData) => {
      resolve(pdfData);
    });

    pdfParser.loadPDF(pdfFilePath);
  });
}

Create variable called pdfContents scoped within the describe block:

let pdfContents: any

Update the beforeAll to read the contents of the pdf into the variable



  test.beforeAll(async ({}) => {
    pdfContents = await getPDFContents('./pdf_sample.pdf')
  })

If you were to debug and inspect the shape of the pdfContents you will notice that the first 2 tests are quite easy to assert.



  test('pdf file should have 6 pages', async () => {
    expect(pdfContents.Pages.length, 'The pdf should have 6 pages').toEqual(6);
  });

  test('shows the correct meta informaion (keywords)', async () => {
    expect(pdfContents.Meta.Keywords, 'PDF keyword was incorrect').toEqual('Standard Fees and Charges, 003-750, 3-750');
  });

However, the last test (assert if "When we may charge fees" is contained in the file) is a little bit more convulted. You will need to expand the Pages array and find the page where you expect the text to exists. You will then need to inspect Texts array to find the text that you are looking for. In our example it was found in first page on the fourth line. This equates to pdfContents.Pages[0].Texts[3].R[0].T

One last complication remains, the raw text that we require "When%20we%20may%20charge%20fees" seems to be encoded. We can easily strip out the encoding use the decodeURI function.



  test('contains the correct subheading text', async () => {
    const rawText = pdfContents.Pages[0].Texts[3].R[0].T
    expect(decodeURI(rawText), 'The subheading text was incorrect').toEqual('When we may charge fees');    
  });

Our final test

Conclusion

I have demonstrated how you can easily verify contents of a pdf using Playwright and pdf2json. We have worked with a very basic pdf containing textual information. Unfortunately, pdf2json may not be able to handle more complex PDF files. YMMV 🥳🚀

Top comments (3)

Yarome Haber • Sep 8 '22

Hi,
You didn't convert the pdf to a json file

Amit Rawat • Feb 15 '23

Is it working with typescript for you?