Mehuli Mukherjee

Posted on Aug 20 • Edited on Sep 4

Why Should Python have all the fun? Meet my new Java Library..

#opensource #java #webdev #programming

I built ExtractPDF4J — a pure-Java library to pull clean tables from messy PDFs (even scanned ones)

ExtractPDF4J is a Java library that finds and extracts tabular data from both text-based and scanned PDFs using stream (layout/text) and lattice (line/vision) parsing—plus OCR. It’s on Maven Central, designed for server-side use (Spring Boot, microservices), and aims for Camelot-style features in the JVM ecosystem.

Maven Central: io.github.mehulimukherjee:extractpdf4j-parser:0.1.0

Repo: ExtractPDF4J

What Java had before

Apache PDFBox (and earlier iText):
These are general PDF libraries. They let you parse page text, metadata, and sometimes low-level drawing instructions. But they don’t provide high-level table extraction APIs. You’d have to implement your own heuristics for columns, bounding boxes, etc.

Tabula Java:
The Tabula desktop tool (originally Java + JRuby) could extract tables, but it wasn’t built as a clean, embeddable Java library. Most devs ended up using Tabula’s command-line wrapper or calling the Python-based Camelot instead.

Commercial SDKs (e.g., ABBYY, Aspose):
Paid libraries offered OCR + table recognition, but they’re proprietary and heavy. For open source Java projects, that’s not ideal.

Why I built it?

Most robust PDF table extractors are Python-first. In fintech/banking backends running on the JVM, pulling Python into prod creates friction (containers, ops, warm-up). I wanted a native Java option with strong accuracy on real-world documents like bank statements, invoices, and reports—without shelling out.

The gap ExtractPDF4J fills

A Camelot-like API in pure Java, with both stream (text-based layout) and lattice (grid/vision) parsing.

What it does (today)

Two parsing modes:

Stream: infers columns from text layout—great for digital PDFs.
Lattice: uses OpenCV-like line and joint detection + OCR—great for scanned PDFs.

OCR support (Tesseract): assigns text to detected grid cells for image-based PDFs.
Multi-page & multi-table: parse page ranges; detect multiple tables per page.
Merged cell handling: rowSpan / colSpan support in lattice mode.
Export helpers: get tables as CSV/JSON (and access cell-wise metadata).
Production-friendly: built for Java 17+ services (e.g., Spring Boot, AWS/EKS, microservices where Python dependencies aren’t welcome).

Quick start

1) Add the dependency
Maven

<dependency>
  <groupId>io.github.mehulimukherjee</groupId>
  <artifactId>extractpdf4j-parser</artifactId>
  <version>0.1.0</version>
</dependency>

Gradle (Kotlin)

implementation("io.github.mehulimukherjee:extractpdf4j-parser:0.1.0")

Optional (for OCR features): install Tesseract OCR and ensure it’s on your PATH.

Minimal examples

A) Text-based PDF (Stream mode)

import com.extractpdf4j.*;

public class StreamExample {
  public static void main(String[] args) throws Exception {
    PdfHandler pdf = PdfHandler.from("samples/statement.pdf");
    StreamParser parser = StreamParser.builder()
        .pages("1-3")               // or "all"
        .detectHeaders(true)        // heuristics for Date/Description/Amount etc.
        .build();

    TableResult result = parser.parse(pdf);
    result.tables().forEach(t -> {
      System.out.println("---- TABLE ----");
      System.out.println(t.toCsv());     // or t.toJson()
    });
  }
}

B) Scanned PDF (Lattice + OCR)

import com.extractpdf4j.*;

public class LatticeExample {
  public static void main(String[] args) throws Exception {
    PdfHandler pdf = PdfHandler.from("samples/scanned_statement.pdf");
    LatticeParser parser = LatticeParser.builder()
        .pages("1-2")
        .enableOcr(true)            // assigns OCR text into grid cells
        .exportDebug(false)         // set true to dump grid/joints as images
        .build();

    TableResult result = parser.parse(pdf);
    result.tables().forEach(t -> {
      // Access structured cells, spans, and metadata
      System.out.println(t.toJson());
    });
  }
}

C) Not sure which mode? Auto-detect

AutoParser auto = AutoParser.builder()
    .pages("all")
    .enableOcr(true)
    .build();

TableResult result = auto.parse(PdfHandler.from("samples/mixed.pdf"));

The rule of thumb:

Use Stream for digital PDFs (selectable text).
Use Lattice for scanned/image PDFs or documents with strong ruling lines.
Use Auto if you want the library to decide per page/table.

Real-world: bank statement extraction

Detects transaction tables across pages

Heuristically identifies headers like Date, Description, Amount

Handles multi-line descriptions and merged cells

Exports clean CSV/JSON ready for downstream reconciliation or analytics

Roadmap

✅ Multi-page, multi-table, spans, OCR assignment

🚧 Hybrid parser: combine Stream output with Lattice boundaries

🚧 Better automatic header detection across global bank formats

🚧 Optional ML table detection backends (e.g., PubTabNet/Donut-style)

🚧 CLI & Docker image for batch pipelines

(If you’d like any of these faster, open an issue or PR!)

Performance notes

Designed to run in containerized Java services alongside your other microservices.

For OCR workloads, consider caching, concurrency limits, and page-selection to control runtime.

Contributing

Try it on a tricky PDF (bank statements, invoices) and share a redacted sample.

File issues for false positives/negatives, header detection edge cases, and speedups.

Feedbacks and PRs welcome for new detectors, exporters, and language packs. Help me make it stronger and sharper tool for JVM..

License & credits

Open source (see LICENSE in the repo).

Inspired by the ideas behind Camelot/Tabula; implemented natively for the Java ecosystem.

Call to action

⭐ Star the repo and try 0.1.0 from Maven Central.

Comment with PDFs you want to handle better—I’ll prioritize those use cases.

P.S. If you write in Python but deploy on JVM, this might save you a few containers and some ops headaches.

Top comments (2)

OnlineProxy • Aug 20

ExtractPDF4J does a decent job wrangling messy tables thanks to some smart heuristics and a mix of lattice and stream parsing. But once those column layouts start shifting from page to page it kinda freaks out. You can get around that with some custom rules, though-bit of DIY magic. It plays nicest when tucked into a Spring Boot microservice, running async batch jobs. Stacked up against Camelot and Tabula, ExtractPDF4J shines if you're living in JVM land. OCR-wise, it holds its own with noisy scans, but you'll wanna hit it with some pre-processing first. Oh, and while it doesn’t officially rock hybrid parsing, you can mash up the Stream and Lattice outputs yourself for better results.

Mehuli Mukherjee • Aug 21

Thanks for the thoughtful take—totally agree that shifting columns are the toughest case. We now ship a hybrid mode (stream+lattice+OCR) and I’m adding header-anchored alignment + simple YAML rules..
If you’re keen, I’d love a PR or sample PDFs to harden it.
Please check the repo : github.com/mehulimukherjee/Extract...