Introduction
Extracting images from PDF documents is a common requirement in many applications. Traditionally, this task required server-side processing, where users had to upload their PDF files to a server, wait for processing, and then download the extracted images. This approach has several drawbacks: privacy concerns, network latency, and dependency on server availability.
In this article, we'll explore how we built a pure client-side solution that runs entirely in the browser, enabling users to extract images from PDFs without ever uploading their files to a server. This implementation leverages the power of modern web technologies including PDF.js, HTML5 Canvas, and WebAssembly.
Why Browser-Based Processing?
Before diving into the technical implementation, let's understand why processing PDFs in the browser is advantageous:
1. Privacy & Security
Users' PDF files never leave their device. This is crucial for sensitive documents containing personal, financial, or confidential information.
2. Zero Server Costs
Since all processing happens on the client side, there are no server infrastructure costs for PDF processing operations.
3. Instant Response
No network latency means faster processing. Users don't need to wait for file uploads and downloads.
4. Offline Capability
Once the application is loaded, it can work offline without any internet connection (except for the initial load).
5. Unlimited File Size
Users can process large PDF files limited only by their device's memory, not server upload limits.
Architecture Overview
Our implementation follows a modular architecture with clear separation of concerns:
Core Components
1. The Main Component: extractimages.tsx
The entry point is a simple React component that orchestrates the image extraction workflow:
"use client";
import { useState } from "react";
import { usePdfjs } from "@/hooks/usepdfjs";
import { useTranslations } from "next-intl";
import { autoDownloadBlob } from "@/utils/pdf";
import { PdfPage } from "@/app/[locale]/_components/pdfpage";
export const Organize = () => {
const [files, setFiles] = useState<File[]>([]);
const { extractImages } = usePdfjs();
const t = useTranslations("ExtractImage");
const mergeInMain = async () => {
console.log("mergeInMain");
files.forEach((e) => console.log(e.name));
const outputFile = await extractImages(files[0]!);
if (outputFile) {
autoDownloadBlob(new Blob([outputFile]), "images.zip");
}
};
const onPdfFiles = (files: File[]) => {
console.log("文件数量或者顺序变化");
files.forEach((e) => console.log(e.name));
setFiles(files);
};
return (
<PdfPage
title={t("title")}
onFiles={onPdfFiles}
desp={t("desp")}
process={mergeInMain}
>
<div></div>
</PdfPage>
);
};
This component is elegantly simple because all the heavy lifting is delegated to the custom usePdfjs hook.
2. The PDF Processing Hook: usepdfjs.ts
This is the heart of our implementation. The hook manages the PDF.js library lifecycle and provides methods for various PDF operations:
import type { getDocument, GlobalWorkerOptions } from "pdfjs-dist";
import { useEffect, useRef, useState } from "react";
import { zipImageBitmaps, extractImagesFromPdf } from "@/lib/parsePdfImage";
import JSZip from "jszip";
// Type definition for pdfjsLib on window
type PdfjsLibType = {
getDocument: typeof getDocument;
GlobalWorkerOptions: typeof GlobalWorkerOptions;
};
export const usePdfjs = () => {
const pdfjsRef = useRef<PdfjsLibType | null>(null);
const loading = useRef(false);
const [loaded, setLoaded] = useState(false);
const extractImages = async (file: File): Promise<ArrayBuffer> => {
const images = await extractImagesFromPdf(pdfjsRef.current!, file);
const buffer = await zipImageBitmaps(images);
return buffer;
};
useEffect(() => {
// Polyfill globalThis for compatibility
if (typeof globalThis === "undefined") {
window.globalThis = window;
}
if (loading.current === true) return;
loading.current = true;
// Dynamically load PDF.js as ES module
const script = document.createElement("script");
script.src = "/pdf/pdf.min.mjs";
script.type = "module";
script.async = true;
script.onload = () => {
console.log("pdfjs-dist loaded");
const typedPdfjs = window.pdfjsLib as PdfjsLibType;
typedPdfjs.GlobalWorkerOptions.workerSrc = "/pdf/pdf.worker.min.mjs";
pdfjsRef.current = typedPdfjs;
loading.current = false;
setLoaded(true);
};
script.onerror = (e) => {
console.error("Failed to load pdfjs-dist:", e);
};
document.head.appendChild(script);
return () => {
document.head.removeChild(script);
pdfjsRef.current = null;
};
}, []);
return { loaded, extractImages };
};
Key Design Decisions:
Dynamic Script Loading: PDF.js is loaded dynamically as an ES module to avoid issues with Next.js and
.mjsfiles during build time.Worker Configuration: We configure the worker source to enable PDF.js to offload heavy parsing work to a Web Worker, preventing UI blocking:
typedPdfjs.GlobalWorkerOptions.workerSrc = "/pdf/pdf.worker.min.mjs";
-
Reference Management: Using
useRefensures we maintain a single instance of the PDF.js library across re-renders.
3. The Image Extraction Engine: parsePdfImage.js
This is where the magic happens. We dive deep into PDF.js internals to extract embedded images:
import JSZip from "jszip";
// Matrix multiplication for coordinate transformations
function multiplyMatrices(m1, m2) {
return [
m1[0] * m2[0] + m1[2] * m2[1],
m1[1] * m2[0] + m1[3] * m2[1],
m1[0] * m2[2] + m1[2] * m2[3],
m1[1] * m2[2] + m1[3] * m2[3],
m1[0] * m2[4] + m1[2] * m2[5] + m1[4],
m1[1] * m2[4] + m1[3] * m2[5] + m1[5],
];
}
// Apply transformation matrix to a point
function applyTransform(p, m) {
var xt = p[0] * m[0] + p[1] * m[2] + m[4];
var yt = p[0] * m[1] + p[1] * m[3] + m[5];
return [xt, yt];
}
Understanding PDF Coordinate Systems
PDF uses a complex coordinate system where:
- The origin (0,0) is at the bottom-left corner
- Y-axis increases upward
- Images can be rotated, scaled, and skewed using transformation matrices
Our code handles these transformations to accurately extract images at their correct positions and orientations.
The Extraction Algorithm
export async function extractImagesFromPdf(pdfjsLib, file) {
const arrayBuffer = await file.arrayBuffer();
const pdf = await pdfjsLib.getDocument({
data: new Uint8Array(arrayBuffer),
cMapPacked: false, // Handle character mapping for CJK fonts
}).promise;
const numPages = pdf.numPages;
console.log("共有几页", pdf.numPages);
const images = [];
// Iterate through all pages (note: page numbers start from 1)
for (let pageNum = 1; pageNum <= numPages; pageNum++) {
const page = await pdf.getPage(pageNum);
const stateStack = [];
const pageDimensions = await getPageDimensions(page);
console.log("版面设置", JSON.stringify(pageDimensions));
const viewport = page.getViewport({ scale: 1 });
let currentTransform = [1, 0, 0, 1, 0, 0]; // Identity matrix
const opList = await page.getOperatorList();
const { fnArray, argsArray } = opList;
// Iterate through all PDF operators
for (let i = 0; i < fnArray.length; i++) {
// Handle coordinate transformations
if (fnArray[i] == pdfjsLib.OPS.transform) {
/*
Transformation matrix components:
[0] a - Scale X and Skew
[1] b - Skew Y and Rotation
[2] c - Skew X and Rotation
[3] d - Scale Y
[4] e - Translate X
[5] f - Translate Y
x' = a*x + c*y + e
y' = b*x + d*y + f
*/
currentTransform = multiplyMatrices(currentTransform, argsArray[i]);
console.log("---", JSON.stringify(argsArray[i]));
} else if (fnArray[i] == pdfjsLib.OPS.save) {
// Push current state to stack
console.log("---save");
stateStack.push(currentTransform);
} else if (fnArray[i] == pdfjsLib.OPS.restore) {
// Pop state from stack
console.log("---restore");
currentTransform = stateStack.pop();
}
// Detect image painting operations
if (
fnArray[i] === pdfjsLib.OPS.paintJpegXObject ||
fnArray[i] === pdfjsLib.OPS.paintImageXObject ||
fnArray[i] === pdfjsLib.OPS.paintXObject ||
fnArray[i] === pdfjsLib.OPS.paintImageMaskXObject
) {
console.log("fnArray", fnArray[i]);
const imageArgs = argsArray[i];
const image = imageArgs[0]; // Object reference or name
const xObjectDict = page.objs?.get(image); // Get image resource
// Extract transformation parameters
const [a, b, c, d, e, f] = currentTransform;
const x = e; // Horizontal position
const y = f; // Vertical position
// Convert to viewer coordinates (origin at top-left)
const viewerY = viewport.height - (y + (xObjectDict.height || 0));
console.log(
`图像位置: X=${x}, Y=${y} (PDF 坐标系) ${currentTransform}`
);
console.log(`图像位置: X=${x}, Y=${viewerY} (Viewer 坐标系)`);
// Store extracted image data
if (xObjectDict) {
images.push({
page: pageNum,
name: image,
width: xObjectDict.width,
height: xObjectDict.height,
data: xObjectDict.bitmap,
format: "png",
});
}
// Handle pre-encoded image data (JPEG, PNG)
if (image && image.imageData) {
images.push({
page: pageNum,
width: image.width,
height: image.height,
format: image.colorSpace === "DeviceRGB" ? "jpg" : "png",
data: image.imageData,
});
}
}
}
}
console.log(`共有 ${images.length} 图片`);
return images;
}
The Algorithm Flow
Key Technical Details:
Operator List Parsing: PDF.js converts each page into an "operator list" - a sequence of drawing commands. We iterate through these to find image-related operations.
State Management: PDF uses a graphics state stack. We must properly handle
saveandrestoreoperations to maintain correct transformation matrices.-
Image Types: We handle multiple image formats:
-
paintJpegXObject- JPEG images -
paintImageXObject- Generic images (PNG, etc.) -
paintXObject- Form XObjects (may contain images) -
paintImageMaskXObject- Image masks
-
Bitmap Extraction: Raw bitmap data is accessed via
xObjectDict.bitmap- this is anImageBitmapobject that can be directly rendered to canvas.
4. Image Conversion and Packaging
Once we have the raw image data, we need to convert it to a standard format (PNG) and package it for download:
export async function imageBitmapToPngBlob(data, width, height) {
// Create offscreen canvas
const canvas = document.createElement("canvas");
canvas.width = width;
canvas.height = height;
canvas.style.width = "100%";
canvas.style.height = "100%";
const ctx = canvas.getContext("2d");
// Draw ImageBitmap to canvas
ctx.drawImage(data, 0, 0);
// Convert to PNG Blob
const p = new Promise((resolve, reject) => {
canvas.toBlob((blob) => {
if (!blob) {
reject(null);
}
resolve(blob);
}, "image/png");
});
return await p;
}
export async function zipImageBitmaps(data) {
const zip = new JSZip();
// Process each extracted image
for (let i = 0; i < data.length; i++) {
const bitmap = data[i];
// Convert ImageBitmap to PNG Blob
const pngBlob = await imageBitmapToPngBlob(
bitmap.data,
bitmap.width,
bitmap.height
);
console.log("image blob size", pngBlob.size);
// Add to ZIP with original resource name
zip.file(bitmap.name, pngBlob);
}
// Generate ZIP file with compression
const zipBuffer = await zip.generateAsync({
type: "arraybuffer",
compression: "DEFLATE",
compressionOptions: { level: 6 },
});
return zipBuffer;
}
The process flow:
5. File Download Utility
Finally, we provide a utility to trigger the file download:
export function autoDownloadBlob(blob: Blob, filename: string) {
// 1. Create temporary Blob URL
const blobUrl = URL.createObjectURL(blob);
// 2. Create hidden anchor element
const downloadLink = document.createElement("a");
downloadLink.href = blobUrl;
downloadLink.download = filename;
downloadLink.style.display = "none";
// 3. Add to DOM (required by some browsers)
document.body.appendChild(downloadLink);
// 4. Trigger download
downloadLink.click();
// 5. Cleanup resources
document.body.removeChild(downloadLink);
URL.revokeObjectURL(blobUrl);
}
Worker Thread Utilization
Although our main extraction logic runs in the main thread, PDF.js internally uses Web Workers for parsing. This is crucial for performance:
// Configuration in usepdfjs.ts
typedPdfjs.GlobalWorkerOptions.workerSrc = "/pdf/pdf.worker.min.mjs";
The worker handles:
- PDF structure parsing
- Stream decompression
- Font decoding
- Image decompression
This keeps the main thread responsive while heavy parsing operations happen in the background.
File Selection and UI Components
Our implementation includes a sophisticated file selection system that supports both modern and legacy browsers:
// Modern File System Access API
const pickFiles = async (multiple: boolean): Promise<File[] | null> => {
try {
const f = await window?.showOpenFilePicker({
multiple: multiple,
types: [
{
description: "Image File",
accept: {
"image/jpeg": [".jpg", ".jpeg"],
"image/png": [".png"],
"application/pdf": [".pdf"],
},
},
],
mode: "read",
});
// ...
} catch {
return null;
}
};
// Fallback to traditional file input
async function traditionalPicker(): Promise<File[] | null> {
const input = initFileInput();
const p = new Promise<File[] | null>((resolve, _reject) => {
input.onchange = (e) => {
const target = e.target as HTMLInputElement;
const files = target.files ? Array.from(target.files) : [];
resolve(files);
target.value = "";
};
});
input.click();
return p;
}
Complete Data Flow
Performance Considerations
Memory Management
-
ImageBitmap.close(): Although commented out in our code, calling
bitmap.close()after processing can free GPU memory - Blob URL Cleanup: We properly revoke object URLs to prevent memory leaks
- Streaming: For very large PDFs, consider implementing chunked processing
Optimization Strategies
- Lazy Loading: PDF.js scripts are loaded only when needed
- Worker Pool: For batch processing multiple files, implement a worker pool
- Progress Indicators: Show extraction progress for large documents
- Image Preview: Generate thumbnails for extracted images
Browser Compatibility
Our implementation works in all modern browsers:
- Chrome/Edge: Full support including File System Access API
- Firefox: Full support with fallback to traditional file input
- Safari: Full support (iOS 13+, macOS 10.15+)
- Mobile: Works on iOS Safari and Android Chrome
Security Considerations
- CSP Compliance: Dynamic script loading requires proper Content Security Policy configuration
- No External Requests: Once loaded, the application doesn't need internet connectivity
- Sandboxed Processing: PDF.js runs in a sandboxed worker thread
Conclusion
We've demonstrated how to build a complete PDF image extraction solution that runs entirely in the browser. By leveraging PDF.js for parsing, HTML5 Canvas for image conversion, and JSZip for packaging, we created a privacy-focused, high-performance tool that requires zero server infrastructure.
The key innovations include:
- Deep PDF.js integration for accessing raw image data
- Transformation matrix handling for accurate coordinate conversion
- Modular architecture separating UI, processing, and utilities
- Progressive enhancement supporting both modern and legacy browsers
Try It Yourself
Ready to extract images from your PDFs? Visit our online tool:
Our tool is completely free, requires no registration, and processes everything locally in your browser. Your files never leave your device, ensuring maximum privacy and security.
Built with ❤️ using Next.js, PDF.js, and modern web technologies.




Top comments (0)