hiyoyo

Posted on Apr 24

lopdf vs pdfium in Rust — Why I Chose the Smaller One

#rust #tauri #pdf #programming

All tests run on an 8-year-old MacBook Air.

When I started building a PDF tool in Rust, the first decision was which PDF library to use.

The two main options: lopdf and pdfium-render. I chose lopdf. Here's why — and where it hurts.

The options

pdfium-render

Bindings to Google's PDFium (the engine inside Chrome)
Excellent rendering quality
Large binary (~10MB added to app size)
Requires bundling the PDFium shared library
Great for viewing, not great for manipulation

lopdf

Pure Rust PDF manipulation library
No external dependencies
Small binary footprint
Full access to the raw object tree
Rendering quality: you're on your own

Why lopdf won

I'm building a tool, not a viewer.

lopdf gives direct access to every object in the PDF — dictionaries, streams, cross-reference tables, the works. For operations like metadata stripping, Bates numbering, stealth watermarking, and structural rebuilding, this low-level access is exactly what you need.

pdfium would abstract all of that away.

// lopdf: direct object manipulation
doc.trailer.remove(b"Info");

for (_, object) in doc.objects.iter_mut() {
    if let Ok(dict) = object.as_dict_mut() {
        dict.remove(b"Author");
    }
}

You can't do this with pdfium bindings. It doesn't expose the raw object tree.

Where lopdf hurts

Rendering. lopdf can't render pages to images. Zero.

My workaround: use macOS PDFKit (via a Swift sidecar) for all rendering, lopdf for all manipulation. Two engines, clear separation of responsibilities.

Complex PDF features. Heavily encrypted PDFs, some form types, certain font encodings — lopdf struggles. For a general-purpose viewer this would be a dealbreaker. For a tool focused on manipulation, it's acceptable.

The verdict

If you're building a PDF viewer: pdfium-render.
If you're building a PDF tool that manipulates document structure: lopdf.

They're solving different problems.

Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault
X → @hiyoyok

Top comments (4)

mote • Apr 24

Great breakdown. I ran into the same dilemma last year building a document pipeline in Rust.

One thing I'd add: lopdf's error handling around malformed PDFs can be... rough. The crate silently skips objects it can't parse, which means you can strip metadata from a file and not realize half the xref table was garbage. I ended up writing a validation pass before any mutation.

The dual-engine approach is smart though. Have you considered using pdf-writer from the pdf crate for write-side operations? It's stricter about output conformance than lopdf's save, and catches things like duplicate object IDs at build time.

Also curious — how are you handling cross-reference streams vs classic xref tables? Some of the newer PDFs from Adobe's ecosystem only use streams and lopdf's reconstruction can produce files that Acrobat flags as 'repaired.'

hiyoyo • Apr 25

This is really useful feedback, thank you.

The silent-skip behavior on malformed objects is exactly the kind of thing that bites you in production. I haven't added a formal validation pass yet — I've been relying on the output rendering correctly as an implicit check, which is obviously not good enough. Adding a pre-mutation validation step is going on the list.

Re: pdf-writer from the pdf crate — I wasn't aware it catches duplicate object IDs at build time. That's a compelling reason to use it for the write side. lopdf's save is convenient but I've definitely seen it produce files that other tools flag. Will look into swapping the write path.

On xref streams vs classic xref tables: honestly this is an area where I know I have gaps. I've been testing mostly against PDFs exported from Word and macOS tools, which tend to use classic xref. The Adobe-ecosystem files with stream-only xref are on my radar but not fully handled yet. If you've found a reliable way to detect which format a file uses before attempting reconstruction I'd be curious how you approached it.

mote • Apr 24

The Go → Rust trajectory you describe is eerily familiar. I hit almost the exact same wall with GC + GPU memory — though in my case it was trying to manage vector embeddings on a Raspberry Pi for a robot's perception pipeline, not training a model. The GC would reclaim tensor memory that the neural runtime was still using, and the only "fix" was to copy everything to heap-allocated buffers. It worked, but it killed the latency budget for real-time inference.

Your point about libtorch FFI being a "shipping instinct" rather than a purity trap really resonates. The discourse in the Rust ML space often defaults to "pure Rust or nothing," but I think that ignores the reality that libtorch's CUDA kernels represent thousands of engineer-years of optimization that no Rust-native project can match in the near term. The FlowBuilder DSL approach is clever — declarative graph composition with selective freezing is exactly the kind of ergonomic layer that makes Rust viable for rapid ML prototyping.

One thing I'm curious about: how does the CUDA Graphs integration work in practice for dynamic architectures? The FBRL project sounds like it has variable computational graphs (especially with the recursive feedback loops). Does CUDA Graphs handle graph structure changes between iterations, or do you need to rebuild the capture for each new architecture configuration?

I ran into a similar tension with vector search on embedded hardware — you want zero-copy for performance, but the embedding dimensions keep changing as the model evolves. Ended up building a small Rust-native embedding store with mmap'd storage to keep things deterministic. The memory ownership story in Rust was the whole reason we could make it work on constrained hardware where Go's GC would have been a non-starter.

hiyoyo • Apr 25

Hey, thanks for the detailed comment! Though I think you might have landed on the wrong article — I don't cover Go, CUDA Graphs, or FBRL anywhere in this one 😄 Sounds like you might be thinking of a different post?

That said, your point about GC reclaiming memory that a neural runtime is still using is a real problem I've heard about from others doing embedded ML. The Rust-native embedding store with mmap'd storage sounds like exactly the right call for constrained hardware — the deterministic memory ownership is the whole point.

If you have a write-up on the Raspberry Pi perception pipeline I'd genuinely love to read it.