Derek

Posted on Jun 10

2026 PDF Table Extraction Tools Review: 15 Tools Benchmarked

#pdf #table #pdftoexcel #extractpdf

PDF was designed for "what you see is what you get" document exchange, not structured data — this creates fundamental barriers for table extraction:

PDF lacks a table semantic model: Tables in PDFs are merely collections of lines and characters, with no abstract definition of rows, columns, or cells
Complex table structures pose typical challenges: Merged cells, rotated text, cross-page tables, borderless tables — each of these scenarios causes most tools to produce severely degraded output
Scanned vs. text-based PDFs are fundamentally different: The former relies on OCR engines to convert images to text, while the latter can directly parse the character encoding layer — each demands very different tool capabilities

This article presents a horizontal benchmark of mainstream commercial SDKs/APIs, open-source tools, and free online tools for PDF table extraction, producing quantifiable comparison results based on unified test samples.

Test PDF Description

Test File: Scanned PDF (SEO: The Art of SEO 3rd Edition, English, Page 114)

Output note: Screenshot of original scanned test file (tools that cannot process scanned files were tested on other text-based PDF tables instead)

Property	Value
Type	Scanned (image-only, no text layer)
Page Size	504 x 661.5 pt
Embedded Image	1008 x 1323 px, RGB, 8bit
Chars/Lines/Rects	0 / 0 / 0
Actual Content	1 table with 4 columns and multiple rows

This sample represents a high-difficulty test scenario: a scanned PDF with no text layer, semi-bordered table structure, and hierarchical headers. It requires tools to handle both OCR recognition and table structure reconstruction — most pure-parsing Python libraries cannot process it directly.

Online PDF Table Extraction Tools

Quick Selection Guide

Use Case	Recommended Tool	Alternative	Rationale
Daily simple table to Excel	SmallPDF	iLovePDF	Shortest workflow, upload and get results
Privacy-sensitive documents	PDF24	—	Completely free, strong privacy protection
Developer API integration	ComPDF etc.	PDFTables	Provides REST API for programmable calls
Complex tables (merged cells/hierarchical headers)	ComPDF or other commercial SDKs	—	Online tools generally perform poorly on complex tables; professional SDKs with structure reconstruction are recommended

1. ExtractTable

Item	Details
URL	extracttable.com
Type	Cloud API + Web Demo
Scanned Support	Yes (OCR)
Pricing	Credit-based, from 50 credits/$3

Web demo supports images only (JPG/PNG); paid version supports PDF. Output format: CSV/Excel.

Test Results: Due to demo limitations (images only, 2/day), full scanned PDF testing was not possible. Tested with image input instead — basic continuous text extraction was usable, but = was misrecognized as —, and bold styling, cell sizing, unordered lists, and other formatting were all lost.

According to Mark Kramer's benchmark, ExtractTable has issues with merged cell content misalignment and missing data.

2. SmallPDF

Item	Details
URL	smallpdf.com
Type	Online web tool
Scanned Support	Yes (OCR, Pro version, 7-day free trial)
Free Limit	2/day free, Pro $12/month

Offers PDF to Excel conversion with easy workflow. However, recognition capability is limited with complex structures like merged cells and rotated text.

Test Results: Performed well among online tools — table structure, merged cells, vertical text, and superscript/subscript were all effectively recognized.

3. iLovePDF

Item	Details
URL	ilovepdf.com
Type	Online web tool
Scanned Support	Yes (OCR, paid version)
Free Limit	2/hour free, Pro $6/month

Offers PDF to Excel, PDF to Word, and multi-format conversion. Free version does not include OCR.

Test Results: OCR recognition accuracy was insufficient — some text areas were converted successfully, but significant content remained embedded as raw image slices within tables, failing to achieve true structured output.

4. PDF24 Tools

Item	Details
URL	tools.pdf24.org
Type	Online + Desktop client
Scanned Support	Limited
Pricing	Completely free

German-developed free PDF toolset offering PDF to Excel conversion with no file size limits and strong privacy protection. Limited support for complex table structures.

Test Results: Scanned documents were not OCR-processed — images were embedded directly into Excel output. Table structure recognition failed, with text loss and incorrect cell merge logic.

5. PDFTables.com (Online)

Item	Details
URL	pdftables.com
Type	Online web + API
Scanned Support	No OCR
Pricing	Credit-based, $50/1000 pages

Supports drag-and-drop upload and conversion. Good conversion quality for standard bordered tables, but does not support scanned documents and has no free trial.

Commercial PDF Table Extraction Tools

Quick Selection Guide

Use Case	Recommended Tool	Rationale
Complex merged cells/hierarchical headers/style preservation	ComPDF	Only commercial SDK verified across hierarchical headers, merged cells, and style preservation
Cloud-native/high throughput/AWS ecosystem	AWS Textract	Deep AWS integration, pay-per-use, suitable for elastic throughput
Cross-platform SDK integration (Web/Mobile)	ComPDF	Native cross-platform SDK, suitable for embedding
Desktop occasional use	Adobe Acrobat	Most widely used PDF desktop tool
Enterprise full-stack document processing	ComPDF	Full-platform SDK, enterprise-grade private deployment

1. ComPDF (Recommended)

Item	Details
Product	ComPDF SDK / API
SDK Languages	Python, Java, Go, iOS, Android, C#
Pricing	Contact sales

Core Capabilities (Source: ComPDF Official)

ComPDF is one of the few commercial table extraction SDKs covering all three of the following capabilities:

Capability Dimension	Support Status
Table Type Coverage	Bordered, irregular border, borderless tables
Complex Merged Cells	Cross-row/column merged cell structure reconstruction
Content Preservation	Simultaneous text and image extraction from cells
Style Preservation	Font/typeface/size/color/bold-italic full preservation

Third-Party Benchmark Context: Mark Kramer from MITRE tested 12 mainstream tools in a horizontal benchmark, with the following conclusion (Source: Medium):

"Among all the commercial solutions, ComPDF was the only tool to correctly capture the hierarchical column headers."

Our Test Results:

Evaluation Item	Conclusion
Hierarchical Merged Headers	Correctly captured, best performance among all commercial tools tested
Row/Column Merging	Cross-row/column merge logic fully reconstructed
Text Style	Font, size, bold/italic largely preserved
Table Borders	Correctly identified and reconstructed border positions and line styles
Row/Column Dimensions	Column widths and row heights consistent with original PDF
Known Limitations	Footnote attribution, superscript text, and rotated text recognition still need improvement
SDK Coverage	6 languages (Python / Java / Go / iOS / Android / C#)

2. AWS Textract

Item	Details
Product	Amazon Textract API
Free Tier	100 pages/month for new users (3 months) Pricing
Pricing	$0.015/page (Table mode)

import boto3
client = boto3.client('textract')
response = client.analyze_document(
    Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'invoice.pdf'}},
    FeatureTypes=['TABLES', 'FORMS']
)

Evaluation:

Basic Table Recognition: Good performance on text-based PDF bordered tables, accurate structure reconstruction
Scanned OCR: Leveraging AWS's underlying OCR engine, scanned document processing is at the top tier of commercial APIs
Merged Cells: Limited support; hierarchical header scenarios show deviations from individual cell outputs
Ecosystem Integration: Native connectivity with AWS services (S3/Lambda/SageMaker), suitable for teams with existing AWS infrastructure

3. Nanonets

Item	Details
Product	Nanonets API
Pricing	$0.10-0.30/run Pricing

Third-Party Reference: Mark Kramer's benchmark shows lower omission rates than ExtractTable, but footnote content is output as garbled text, and merged cells cannot correctly express hierarchical relationships.

Our Test: Continuous text recognition is basically usable; basic text styles (font/typeface/size) are preserved; text color, table border line styles, and other formatting are not reconstructed.

4. Nutrient (formerly PSPDFKit)

Item	Details
Product	Nutrient SDK / API
URL	nutrient.io
Positioning	Enterprise PDF SDK (cross-platform)
Platforms	Web, iOS, Android, Windows, macOS
Pricing	Contact sales (enterprise)

Nutrient (formerly PSPDFKit) is a well-known cross-platform PDF SDK vendor, with core capabilities focused on PDF rendering, annotation, and editing. Table extraction is provided as an API module.

Product Positioning:

Cross-platform native SDK with outstanding PDF rendering and interaction performance
Table extraction is not its core scenario; custom development and integration are required
Suitable for teams with existing Nutrient deployments needing supplementary table capabilities

Test Results: Text content recognition accuracy was insufficient, with issues including positional drift, spacing distortion, and partial text loss. Hierarchical relationships between headers and content were not correctly reconstructed.

5. Adobe Acrobat

Item	Details
Product	Adobe Acrobat Pro DC
Pricing	Subscription ~$19.99/month
Scanned Support	Built-in OCR (Pro version)
Use Case	Desktop occasional, small-scale processing

Product Features:

Strengths: Low barrier to entry, no coding required; Pro version includes OCR for direct scanned document export
Limitations: No batch automation API; output quality is unstable with merged cells

Test Results: Overall table structure was reconstructed, but text loss and individual character recognition errors were present. Table border line styles and cell formatting were not preserved.

6. iText (iText 8 Core / iText 7 Community)

Item	Details
Product	iText 8 Core / iText 7 Community
URL	itextpdf.com
License	AGPL (free open-source) / Commercial license
Platforms	Java, .NET (C#)
Pricing	AGPL free / Commercial contact sales

iText is one of the oldest PDF processing libraries (founded in 1998). iText does not provide a dedicated table extraction API — you must use LocationTextExtractionStrategy to parse text positions and infer table structure.

Test Results Across Three PDFs:

Test File	Type	Extraction Result	Table Structure
SEO Book Page 114	Scanned	0 chars	—
Transcript (1).pdf	Text-based transcript	2218+866 chars	Continuous text stream
Prot_000 8.pdf	Text-based clinical protocol	1062 chars	Continuous text stream

Capability Summary:

Scenario	Result
Text-based PDF text extraction	4/5 — mature text layer extraction
Automatic table recognition	Requires custom development (coordinate-based inference)
Scanned OCR	No built-in OCR
PDF creation/editing	5/5 — industry benchmark

Key Takeaway: iText is mature in the domain of low-level PDF operations, but it is not an out-of-the-box table extraction tool. Text content extraction (cumulative 4,146 characters) is complete and reliable, but table structure (column alignment, merged cells, rotated text) is entirely lost. For structured table output, use dedicated table reconstruction tools like ComPDF (commercial), Camelot, or Docling (open-source).

Open-Source PDF Table Extraction Tools

Quick Selection Guide

Use Case	Recommended Tool	Rationale
Scanned/image PDF table extraction	Docling	9.39s实测完成扫描件表格提取，AI pipeline integration ready
Standard bordered text-based PDF tables	Camelot	Simple lattice mode setup, extraction in a few lines of code
Complex tables (merged cells/rotated text)	pdfplumber + custom code	Requires fine-tuning and custom post-processing
Java ecosystem/existing Java projects	tabula-py / iText	Naturally compatible with Java tech stack
Full document understanding (AI pipeline)	Docling	Native integration with LangChain/LlamaIndex
Zero budget	Any open-source tool	All MIT licensed, no licensing fees

1. Docling (IBM)

Item	Details
Version	v2.99.0 (June 8, 2026) GitHub
License	MIT
GitHub Stars	61,200+ — fastest growing PDF open-source project
Dependencies	Python 3.10+, models require initial download (~2-5GB)

Core Features (Source: GitHub README + Technical Report)

Multi-format support (PDF/DOCX/PPTX/XLSX/HTML/images/audio/email etc.)
Built-in TableFormer (claims 93.6% accuracy vs Tabula 67.9%, Camelot 73.0%)
Built-in OCR (via RapidOCR for scanned documents)
Integrations: LangChain / LlamaIndex / Crew AI / Haystack

Test Results (June 2026)

Environment: Models downloaded successfully via HF_ENDPOINT=https://hf-mirror.com

Conversion time: 9.39s
Markdown length: 2573 chars
Heron layout model: loaded (770/770 weights)
OCR engine: RapidOCR
Table detection: successful

Conclusions:

Table Structure Recognition: 4-column multi-row table structure fully preserved, row-column correspondence correct
Scanned Document Processing: Successfully extracted a structured table from a pure image PDF, validating AI model feasibility for scanned documents
OCR English Accuracy: Low due to RapidOCR's optimization for Chinese characters; English recognition drift (e.g., "Google" misrecognized as "Googfe") is the current version's main bottleneck

Improvement Direction: Pairing with an English-specialized OCR engine (e.g., Tesseract English model) would significantly improve English table recognition accuracy on scanned documents.

2. pdfplumber

Item	Details
Version	v0.11.9 (January 2026) PyPI
License	MIT, based on pdfminer.six
Positioning	Character-level low-level PDF parsing engine

Test Results (June 2026):

Chars: 0, Lines: 0, Rects: 0, Tables: 0

Cannot process scanned files — 0 characters, 0 lines, 0 tables. Consistent with official documentation: "Works best on machine-generated, rather than scanned, PDFs".

Text-based PDF Tests (New):

Transcript (1).pdf (Student transcript, 2 pages):

Chars: 2905 | Tables: 0 | Time: 0.13s

pdfplumber successfully extracted 2905 characters of text but detected zero tables (0 tables). This is because pdfplumber's table detection depends on graphical lines (lines/rects), while this transcript uses borderless tables ("ghost tables") where the visual table structure is formed purely by text alignment.

Prot_000 8.pdf (Clinical protocol schedule, 1 page):

Chars: 1017 | Tables: 1 | Time: 1.19s

Successfully detected 1 table (Schedule of Events), benefiting from the table's complete border lines. However, merged cell information (e.g., cross-column time axis headers) was lost, and rotated text could not be recognized.

Conclusion: pdfplumber detects bordered tables accurately (consistent with Mark Kramer's findings), but is completely ineffective on borderless tables. As a low-level library, significant custom code is required for complex scenarios.

3. Camelot

Item	Details
Version	v2.0.0 (June 4, 2026) PyPI
License	MIT
Dependencies	Python 3.10+, PyTorch optional for ML mode

5 Parsers: lattice (bordered) / stream (borderless) / network (text alignment) / ml (Table Transformer) / auto (automatic)

Test Results:

Lattice Mode: Cannot process scanned files; parser explicitly rejects image-based page input

Text-based PDF Tests (New):

Transcript (1).pdf (transcript, borderless table):

Mode	Result
lattice	❌ 0 tables — no table lines detected
stream	✅ 1 table, 95.8% accuracy — 0.07s, successfully identified borderless table

Prot_000 8.pdf (clinical protocol, bordered table):

Mode	Result
lattice	✅ 1 table, 97.97% accuracy — 0.36s, complete structure
stream	✅ 2 tables, 100% accuracy — 0.08s, fastest speed

Conclusion: The extracted Excel files showed disorganized structure and text-to-cell mapping misalignment.

4. tabula-py

Item	Details
Version	v2.10.0 (October 2024) PyPI
License	MIT
Dependency	Java 8+ + Python 3.9+

Test Results (Java 21 + tabula-py 2.10.0):

Tables found: 0

Cannot process scanned files. Even with Java installed, tabula-py cannot parse image-based PDFs without a text layer. Tabula's official documentation states: "Tabula only works on text-based PDFs, not scanned documents."

Text-based PDF Tests (New):

Transcript (1).pdf (Student transcript, 2 pages):

Tables found: 2 | Time: 1.54s
Table 1: 29 rows x 7 cols (transfer credits)
Table 2: 8 rows x 7 cols (GE 2022 semester)

tabula-py successfully identified the two semester course tables as separate DataFrames with complete column structure (Course Code / Description / Credits / Grade / Quality Points). All 29 transfer credit course records were correctly captured.

Prot_000 8.pdf (Clinical protocol schedule, 1 page):

ERROR: 'utf-8' codec can't decode byte 0xa1

tabula-py threw an encoding exception when processing this PDF. The PDF contains special characters (e.g., registered trademark symbol ®, ±), and tabula-py's Java subprocess output decoding failed.

Summary: tabula-py's text recognition for standard tables (e.g., transcripts) is basically correct, but the structure reconstruction is disorganized, text-to-row/column mapping is misaligned, and table borders were not correctly identified.

Conclusion

This article presents a horizontal benchmark of 15 tools using three test files of varying difficulty (scanned no-text-layer PDF / text-based transcript PDF / text-based clinical protocol PDF). The key findings are:

1. Scanned Documents Remain the Biggest Differentiator

Of the 15 tools tested, only Docling, ComPDF, AWS Textract, Nanonets, Nutrient, and Adobe Acrobat can process scanned documents
Pure-parsing tools like pdfplumber, Camelot(lattice), tabula-py, and iText completely fail on scanned documents (0 chars/0 tables)
OCR capability is the first differentiator for PDF table extraction tools

2. Table Structure Reconstruction on Text-Based PDFs Is the Real Test

Even among tools that successfully extract text from text-based PDFs, very few can fully preserve table structure (row-column correspondence, merged cells, border styles)
Docling produced 27 Markdown tables (18,048 chars) from the text transcript and 21 tables (12,922 chars) from the clinical protocol — its AI-driven layout analysis significantly outperforms other open-source tools
Camelot achieves the highest accuracy on bordered tables (lattice 97.97%), and its stream mode works on borderless tables (95.8%) — the best choice among pure-parsing tools
tabula-py correctly identifies columns on standard transcript tables but has encoding compatibility issues
iText extracts text completely (4,146 characters cumulative), but table structure is entirely lost — only continuous text stream output

3. The Gap Between Commercial and Open-Source Tools Is Largest for Complex Tables

For merged cells/hierarchical headers, ComPDF is the only commercially available SDK that passed verification
Open-source tools are adequate for simple bordered tables, but the gap widens significantly for complex table scenarios

DEV Community

2026 PDF Table Extraction Tools Review: 15 Tools Benchmarked

Test PDF Description

Online PDF Table Extraction Tools

Quick Selection Guide

1. ExtractTable

2. SmallPDF

3. iLovePDF

4. PDF24 Tools

5. PDFTables.com (Online)

Commercial PDF Table Extraction Tools

Quick Selection Guide

1. ComPDF (Recommended)

2. AWS Textract

3. Nanonets

4. Nutrient (formerly PSPDFKit)

5. Adobe Acrobat

6. iText (iText 8 Core / iText 7 Community)

Open-Source PDF Table Extraction Tools

Quick Selection Guide

1. Docling (IBM)

2. pdfplumber

3. Camelot

4. tabula-py

Conclusion

Top comments (0)