Building a Client-Side PDF Compressor using JavaScript and Web Workers

#javascript #frontend #performance #webdev

When we started building PDF tools, the default architectural choice was obvious: upload the file to a backend (Python/Node), wrap a CLI tool like Ghostscript, process it, and send it back.

But that approach has three massive downsides:

Latency: Uploading a 50MB PDF just to shave off 10MB takes too long.

Privacy: Users are increasingly skeptical about uploading sensitive documents to unknown servers.

Cost: Processing PDFs is CPU-intensive. Scaling a fleet of servers to handle heavy compression hits the budget hard.

We decided to move the entire compression pipeline to the client side. Here is how we engineered a browser-based PDF compressor that manipulates binary data without freezing the UI.

The Problem: PDFs are just containers
To compress a PDF effectively without ruining the text quality, you have to understand what makes them heavy. Usually, it's not the vector fonts or the text streams—it’s the embedded images.

A scanned document or a marketing deck is often just a container holding massive, unoptimized JPEGs or PNGs.

Our strategy was straightforward but technically difficult to implement in a browser:

Parse the PDF structure.

Iterate through the object catalog to find image streams.

Extract the raw image bytes.

Downsample and compress the images using the HTML5 Canvas API.

Re-inject the smaller images into the PDF structure.

Save the new blob.

The Code
For the PDF parsing and structure manipulation, we utilize pdf-lib. However, pdf-lib doesn't natively "compress" images—it just stores them. We had to write a custom routine to intercept the images and crunch them.

Here is the core logic for the image compression step. We use a Canvas to handle the resampling (changing dimensions) and the compression (quality reduction).

Note: In production, this must run inside a Web Worker. If you run this on the main thread, the browser will freeze while processing large documents.

/**
 * Compresses an image buffer using the HTML5 Canvas API.
 * * @param {Uint8Array} imageBytes - The raw bytes of the image from the PDF.
 * @param {string} mimeType - 'image/jpeg' or 'image/png'.
 * @param {number} quality - 0.0 to 1.0 (e.g., 0.7 for 70% quality).
 * @param {number} scale - 0.0 to 1.0 (e.g., 0.5 to half the resolution).
 * @returns {Promise<Uint8Array>} - The compressed image bytes.
 */
async function compressImage(imageBytes, mimeType, quality = 0.7, scale = 1.0) {
  return new Promise((resolve, reject) => {
    // Create an Image object (not attached to DOM)
    const img = new Image();

    // Create a Blob URL to load the data into the Image object
    const blob = new Blob([imageBytes], { type: mimeType });
    const url = URL.createObjectURL(blob);

    img.onload = () => {
      // Clean up memory
      URL.revokeObjectURL(url);

      // calculate new dimensions
      const targetWidth = img.width * scale;
      const targetHeight = img.height * scale;

      // Create an OffscreenCanvas (preferred for Workers) or standard Canvas
      // Note: OffscreenCanvas support is good but check generic fallback if needed
      let canvas;
      let ctx;

      if (typeof OffscreenCanvas !== 'undefined') {
        canvas = new OffscreenCanvas(targetWidth, targetHeight);
        ctx = canvas.getContext('2d');
      } else {
        canvas = document.createElement('canvas');
        canvas.width = targetWidth;
        canvas.height = targetHeight;
        ctx = canvas.getContext('2d');
      }

      // Draw and resize
      ctx.drawImage(img, 0, 0, targetWidth, targetHeight);

      // Export to blob with compression
      if (canvas.convertToBlob) {
        // OffscreenCanvas API
        canvas.convertToBlob({ type: 'image/jpeg', quality: quality })
          .then(blob => blob.arrayBuffer())
          .then(buffer => resolve(new Uint8Array(buffer)));
      } else {
        // Standard Canvas API
        canvas.toBlob(
          (blob) => {
             blob.arrayBuffer().then(buffer => resolve(new Uint8Array(buffer)));
          },
          'image/jpeg',
          quality
        );
      }
    };

    img.onerror = (err) => reject(err);
    img.src = url;
  });
}

integrating into the PDF
Once we have that helper function, we iterate through the PDF pages. This snippet demonstrates how we traverse the PDF, locate images, and swap them out.

import { PDFDocument } from 'pdf-lib';

async function compressPdf(pdfBytes) {
  // Load the PDF
  const pdfDoc = await PDFDocument.load(pdfBytes);

  // Get all pages
  const pages = pdfDoc.getPages();

  for (let i = 0; i < pages.length; i++) {
    const page = pages[i];

    // In a real implementation, you need to traverse the page's resources
    // to find XObject Images. This is a simplified abstraction:
    const { images } = getImagesFromPage(page); 

    for (const imgNode of images) {
      // 1. Extract raw bytes
      const originalBytes = imgNode.data;

      // 2. Compress via our helper function
      // We convert everything to JPEG for better compression ratios
      const compressedBytes = await compressImage(originalBytes, 'image/jpeg', 0.6, 0.8);

      // 3. Embed the new image into the PDF document
      const newImage = await pdfDoc.embedJpg(compressedBytes);

      // 4. Replace the reference on the page (keeping dimensions same as visual layout)
      imgNode.replaceWith(newImage);
    }
  }

  // Serialize the PDF to bytes
  const savedBytes = await pdfDoc.save();
  return savedBytes;
}

Live Demo
We integrated this logic (wrapped in robust Web Workers and error handling) into our main platform. The interesting part about this implementation is seeing how fast the progress bar moves solely on the client's CPU.

You can test the compression algorithm here: https://nasajtools.com/tools/pdf/compress-pdf

Try uploading a heavy PDF (10MB+). You’ll notice there is zero network upload latency before processing starts.

Performance Considerations
While the code above works, moving to production required solving a few edge cases:

The Main Thread Blocker
Manipulating 50MB Uint8Arrays and rendering large Canvases is heavy. We utilize Web Workers strictly. The UI thread only handles the file drop and the progress bar updates. If you run the compression on the main thread, the browser will flag the page as "unresponsive."
Memory Leaks
Browsers are aggressive about garbage collection, but Canvas elements and Blob URLs can cause memory spikes. We explicitly revoke Object URLs (URL.revokeObjectURL) and dereference image buffers immediately after processing to prevent the tab from crashing on mobile devices.
OffscreenCanvas vs DOM Canvas
We prefer OffscreenCanvas because it is available inside Web Workers. However, Safari's support for OffscreenCanvas in workers is relatively recent (since 16.4), so we maintain a fallback that posts messages back to the main thread to perform the rendering if the worker API isn't supported.

Summary
Client-side PDF manipulation is more complex than server-side because you are limited by the user's hardware. However, the trade-off is worth it: zero server costs for file processing and a massive trust signal for users who know their data never leaves their device.

Hopefully, this helps you understand how to manipulate binary file types in the browser!

DEV Community

Building a Client-Side PDF Compressor using JavaScript and Web Workers

Top comments (0)