Techniques for Compressing PDF Files

#qpdf #ghostscript #pdfcompression #tutorial

A PDF is a file format used to present a document(including texts and images), in a manner independent of the software application used to view the document. The fact that images can be embedded in a PDF document is the main reason it's size can be very huge.

Most people scan receipts and other documents to PDF, and without OCR processing, the pages are stored as images rather than text, thereby increasing the overall size of the document.

To help optimize the document we will be using Ghostscript and qpdf to come up with a grey-scaled version of the document with a resolution of 300dpi.

NOTE: This tutorial assumes you have some knowledge of Docker and Nodejs.

Environment Setup

To keep our application contained, we will be using Docker to package it.

First, we create a project folder and create these two files within the folder. Dockerfile and index.js. You should have something similar to the this structure.

${project-dir}
├── Dockerfile
├── index.js

We will be using a basic Nodejs alpine image for this.

FROM node:16.15-alpine

RUN apk add --update alpine-sdk

RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application

CMD ["node", "index.js"]

Write the script

Since we will be using command line utilities, we will be using Nodejs' built-in child_process to execute our commands.

const { exec } = require('child_process');

function compressFile() {
  return new Promise((resolve, reject) => {
    exec('command', (error, stdout, stderr) => {
      if (error) {
        return reject(error);
      }
      if (stderr) {
        return reject(stderr);
      }
      return resolve('Done');
    });
  });
}

compressFile();

Solution 1 - Ghostscript

First we will need to modify our Dockerfile to add the ghostscript binary by adding a new line.

RUN apk add --no-cache ghostscript

Your Dockerfile should now look similar to this.

FROM node:16.15-alpine

RUN apk add --no-cache python3 py3-pip
RUN apk add --no-cache ghostscript
RUN apk add --update alpine-sdk

RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application

CMD ["node", "index.js"]

To use convert our files, here is the command we will run against the Ghostscript binary.

gs \
  -sDEVICE=pdfwrite \
  -dCompatibilityLevel=1.5 \
  -dPDFSETTINGS=/printer \
  -dNOPAUSE \
  -dBATCH \
  -dQUIET \
  -sOutputFile=output.pdf \
  input.pdf

Command Breakdown

-sDEVICE=pdfwrite selects which output device Ghostscript should use. We are compressing a PDF file so we will be using pdfwrite. See this page for other options.
-dCompatibilityLevel=1.5 generates a PDF version 1.5. Here's a list of all PDF versions.
-dPDFSETTINGS=/printer sets the image quality for printers. For additional compression choose /screen. Printer has a dpi of 300, while screen has 72.
-dBATCH and -dNOPAUSE Ghostscript will process the input file(s) without interaction and will exit when completed.
-dQUIET mutes routine information comments on standard output.
-sOutputFile=output.pdf sets the path to store the compressed file
input.pdf the path of the file to process.

You can read the docs to see other available options. For our use case, we will use be using the above listed options.

...
  const fileName = 'sample.pdf';
  const fileIn = `./${fileName}`;
  const fileOut = `-sOutputFile=./compressed__${new Date().toISOString()}__${fileName}`;
  const command = `gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -dPDFSETTINGS=/printer -dNOPAUSE -dBATCH -dQUIET ${fileOut} ${fileIn}`;
...

After execution the output file name will include the compressed and the date string in to differentiate between the compressed file and the original file.

Your complete code should look like this.

const { exec } = require('child_process');

function compressFile() {
  return new Promise((resolve, reject) => {
    const fileName = 'sample.pdf';
    const fileIn = `./${fileName}`;
    const fileOut = `-sOutputFile=./compressed__${new Date().toISOString()}__${fileName}`;
    const command = `gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -dNOPAUSE -dBATCH -dQUIET ${fileOut} ${fileIn}`;

    exec(command, (error, stdout, stderr) => {
      if (error) {
        return reject(error);
      }
      if (stderr) {
        return reject(stderr);
      }
      console.log(`stdout: ${stdout}`);
      return resolve('Done');
    });
  });
}

compressFile();

Solution 2 - QPDF

Similar to our ghostscript setup, we will need to add qpdf to our Dockerfile

RUN apk add --no-cache qpdf

Your Dockerfile should now look similar to this.

FROM node:16.15-alpine

RUN apk add --no-cache python3 py3-pip
RUN apk add --no-cache qpdf
RUN apk add --update alpine-sdk

RUN mkdir -m 755 /home/node/application
COPY . /home/node/application
WORKDIR /home/node/application

CMD ["node", "index.js"]

To use convert our files, here is the command we will run against the Ghostscript binary.

qpdf --optimize-images input.pdf output.pdf

As you can see from qpdf options, we are explicitly asking the library to optimize the images in our pdf file. Next, we update our code to include the qpdf command

...
  const fileName = 'sample.pdf';
  const fileIn = `./${fileName}`;
  const fileOut = `./compressed__${new Date().toISOString()}__${fileName}`;
  const command = `qpdf --optimize-images ${fileOut} ${fileIn}`;
...

Your complete code should look like this.

const { exec } = require('child_process');

function compressFile() {
  return new Promise((resolve, reject) => {
    const fileName = 'sample.pdf';
    const fileIn = `./${fileName}`;
    const fileOut = `./compressed__${new Date().toISOString()}__${fileName}`;
    const command = `qpdf --optimize-images ${fileOut} ${fileIn}`;

    exec(command, (error, stdout, stderr) => {
      if (error) {
        return reject(error);
      }
      if (stderr) {
        return reject(stderr);
      }
      console.log(`stdout: ${stdout}`);
      return resolve('Done');
    });
  });
}

compressFile();

Test the code

First, let us build the image to bundle our code together with our chosen binary.

docker build . -t pdf-compressor

To run the command I will be mounting the /home/node/application directory to a directory on my local machine that have the files I will like to compress so the code and reach it, and also output the compressed files in the same directory.

docker run -it -v ${PWD}:/home/node/application pdf-compressor

Conclusion

The gains made on the compression depends mostly on how many uncompressed/unoptimized images are present in the document. You can test both solutions and tweak their options until you find a combination that gives you the best result.

Originally Published on BlockQueue's Blog