monkeymore studio

Posted on Apr 9

Building a Browser-Based PDF Page Removal Tool with WebAssembly and Web Workers

#webdev #javascript #privacy #tutorial

In this article, we'll explore how to implement a pure client-side PDF page removal tool that runs entirely in the browser. No server required, no file uploads, complete privacy protection.

Why Browser-Based PDF Processing?

Traditional PDF processing typically requires:

Uploading files to a server
Processing on the backend
Downloading the result

This approach has significant drawbacks:

Privacy concerns - Your sensitive documents are sent to third-party servers
Network dependency - Requires stable internet connection
Latency - Upload and download times for large files
Server costs - Backend infrastructure required

Browser-based processing solves all these issues:

✅ Files never leave your computer
✅ Works offline after initial load
✅ Instant processing
✅ Zero server costs for PDF operations

Architecture Overview

Our solution combines three powerful web technologies:

WebAssembly (WASM) - Running QPDF (a powerful PDF manipulation library) compiled to WASM
Web Workers - Offloading heavy PDF operations to a background thread
Comlink - Making worker communication as simple as async function calls

Core Data Structures

PageRange Type

The fundamental data structure for specifying which pages to remove:

// types/pdfdata.ts
export type PageRange = [number, number];

Each PageRange is a tuple where:

Index 0: Start page number (inclusive)
Index 1: End page number (inclusive)
Single pages are represented as [n, n]

WorkerFunctions Interface

The contract between main thread and worker:

// hooks/useqpdf.ts
interface WorkerFunctions {
  init: () => Promise<void>;
  remove: (files: File, ...range: PageRange[]) => Promise<ArrayBuffer | null>;
  // ... other operations
}

Implementation Deep Dive

1. User Interface Layer

The UI component handles user input and triggers the removal process:

// app/[locale]/_components/qpdf/remove.tsx
export const Organize = () => {
  const [files, setFiles] = useState<File[]>([]);
  const { value: pages, onChange: onChangeUserPassword } =
    useInputValue<string>("1-z");
  const { remove } = useQpdf();

  const mergeInMain = async () => {
    // Parse user input: "1-3,5,10-z" → [[1,3], [5,5], [10,10000]]
    const remoePages = pages
      .replaceAll("，", ",")  // Support Chinese comma
      .split(",")
      .map((e) => {
        if (e.includes("-")) {
          const t = e.split("-");
          return [parseInt(t[0]!), parseInt(t[1]!)] as PageRange;
        } else {
          return [parseInt(e), parseInt(e)] as PageRange;
        }
      });

    const outputFile = await remove(files[0]!, ...remoePages);

    if (outputFile) {
      autoDownloadBlob(new Blob([outputFile]), "organize.pdf");
    }
  };
  // ...
};

Key features of the input format:

Comma-separated ranges: 1-3,5,10-z
Single pages: 5 becomes [5, 5]
Ranges: 1-3 becomes [1, 3]
Special character z represents the last page
Supports Chinese comma ， for localization

2. Worker Management with Comlink

The useQpdf hook manages the Web Worker lifecycle:

// hooks/useqpdf.ts
export const useQpdf = () => {
  const workerRef = useRef<Comlink.Remote<WorkerFunctions>>(null);

  useEffect(() => {
    async function initWorker() {
      if (workerRef.current) return;
      const worker = new PdfWorker();

      worker.onerror = (error) => {
        console.error("Worker error:", error);
      };

      workerRef.current = Comlink.wrap<WorkerFunctions>(worker);
      await workerRef.current.init();
      return () => worker.terminate();
    }

    initWorker().catch(() => { return; });
  }, []);

  const remove = async (
    file: File,
    ...range: PageRange[]
  ): Promise<ArrayBuffer | null> => {
    if (!workerRef.current) return null;
    const r = await workerRef.current.remove(file, ...range);
    return r;
  };

  return { remove };
};

Why Comlink?

Eliminates manual postMessage boilerplate
Provides type-safe function calls
Handles serialization automatically
Makes worker code look like regular async functions

3. The Range Inversion Algorithm

QPDF's --pages flag specifies which pages to keep, not which to remove. So we need to invert the user's "remove" ranges into "keep" ranges:

// hooks/pdf.worker.js
function removeRanges(mainRange, ...excludeRanges) {
  const [start, end] = mainRange;
  const excludeSet = new Set();

  // Collect all pages to exclude
  excludeRanges.forEach(([s, e]) => {
    for (let i = s; i <= e; i++) {
      excludeSet.add(i);
    }
  });

  // Collect remaining pages
  const remaining = [];
  for (let i = start; i <= end; i++) {
    if (!excludeSet.has(i)) {
      remaining.push(i);
    }
  }

  // Convert consecutive numbers to compact ranges
  const result = [];
  if (remaining.length === 0) return result;

  let currentStart = remaining[0];
  let currentEnd = remaining[0];

  for (let i = 1; i < remaining.length; i++) {
    if (remaining[i] === currentEnd + 1) {
      currentEnd = remaining[i];
    } else {
      result.push(
        currentStart === currentEnd
          ? [currentStart]
          : [currentStart, currentEnd]
      );
      currentStart = remaining[i];
      currentEnd = remaining[i];
    }
  }

  result.push(
    currentStart === currentEnd ? [currentStart] : [currentStart, currentEnd]
  );

  return result;
}

Example transformation:

Input: Remove [1,3], [5,5], [10,10000] from document with 100 pages
Process: Exclude pages 1-3, 5, 10-100 → Remaining: 4, 6-9
Output: [[4], [6,9]] → formatted as "4,6-9"

4. QPDF WASM Execution

The core PDF processing happens in the Web Worker using QPDF compiled to WebAssembly:

// hooks/pdf.worker.js
async remove(file, ...range) {
  // Convert File to ArrayBuffer
  const arrayBuffer = await file.arrayBuffer();
  const uint8Array = new Uint8Array(arrayBuffer);

  // Write to QPDF's virtual filesystem
  qpdf.FS.writeFile(`/input.pdf`, uint8Array);

  // Calculate pages to KEEP (inverse of pages to remove)
  const result = removeRanges([1, 10000], ...range);
  result[result.length - 1][1] = "z";  // Use 'z' for last page

  const resultstr = result.map((e) => {
    if (e.length == 1) return e[0] + "";
    else return e[0] + "-" + e[1];
  });

  // Build QPDF command
  const params = [
    "/input.pdf",
    "--pages",
    "/input.pdf",
    resultstr.join(","),  // Pages to KEEP
    "--",
    "/output.pdf",
  ];

  // Execute QPDF
  qpdf.callMain(params);

  // Read output from virtual filesystem
  const outputFile = qpdf.FS.readFile("/output.pdf");
  return outputFile;
}

QPDF Command Example:

# To remove pages 1-3 and 5 from a 100-page document:
# We need to keep pages 4, 6-100
qpdf input.pdf --pages input.pdf 4,6-z -- output.pdf

5. WASM Initialization

The QPDF WASM module is initialized with Emscripten's virtual filesystem:

// lib/qpdfwasm.js
import createModule from "@neslinesli93/qpdf-wasm";

const f = async () => {
  const qpdf = await createModule({
    locateFile: () => "/qpdf.wasm",
    noInitialRun: true,  // Don't run main() immediately
    preRun: [(module) => {
      if (module.FS) {
        // Filesystem is ready
      }
    }],
  });
  return qpdf;
};

Complete Processing Flow

Key Technical Decisions

1. Why QPDF?

QPDF is a powerful command-line tool for PDF manipulation. By compiling it to WASM:

We get battle-tested PDF processing logic
Supports complex operations (merge, split, rotate, encrypt)
Handles edge cases and malformed PDFs well

2. Why Web Workers?

PDF processing can be CPU-intensive:

Parsing large PDFs
Rebuilding document structure
Writing output files

Running in a Web Worker:

Prevents UI freezing
Maintains 60fps during processing
Provides true parallelism on multi-core systems

3. Virtual File System

Emscripten provides an in-memory filesystem:

No actual disk access needed
Fast read/write operations
Automatic cleanup when worker terminates

File Download Utility

After processing, we trigger the browser download:

// utils/pdf.ts
export function autoDownloadBlob(blob: Blob, filename: string) {
  const blobUrl = URL.createObjectURL(blob);
  const downloadLink = document.createElement("a");
  downloadLink.href = blobUrl;
  downloadLink.download = filename;
  downloadLink.style.display = "none";

  document.body.appendChild(downloadLink);
  downloadLink.click();
  document.body.removeChild(downloadLink);
  URL.revokeObjectURL(blobUrl);
}

Benefits of This Architecture

Privacy First: Files never leave the browser
Performance: Near-native speed with WASM
Responsive UI: Web Workers prevent blocking
Type Safety: TypeScript + Comlink = type-safe worker communication
Maintainability: Clean separation of concerns

Try It Yourself

Want to remove pages from your PDF without uploading anything to a server? Try our free browser-based PDF tool:

Remove PDF Pages Online →

All processing happens locally in your browser - your files never leave your computer!

Conclusion

Building a browser-based PDF processing tool demonstrates the power of modern web technologies. By combining WebAssembly, Web Workers, and Comlink, we can perform complex PDF operations entirely client-side while maintaining a responsive user interface.

This approach is ideal for:

Privacy-sensitive documents
Offline-capable applications
Reducing server costs
Improving user experience with instant processing

The complete source code demonstrates production-ready patterns for WASM integration in React applications.

DEV Community