DEV Community

Derek
Derek

Posted on

2026 PDF Table Extraction Tools Review: 15 Tools Benchmarked

PDF was designed for "what you see is what you get" document exchange, not structured data — this creates fundamental barriers for table extraction:

  • PDF lacks a table semantic model: Tables in PDFs are merely collections of lines and characters, with no abstract definition of rows, columns, or cells
  • Complex table structures pose typical challenges: Merged cells, rotated text, cross-page tables, borderless tables — each of these scenarios causes most tools to produce severely degraded output
  • Scanned vs. text-based PDFs are fundamentally different: The former relies on OCR engines to convert images to text, while the latter can directly parse the character encoding layer — each demands very different tool capabilities

This article presents a horizontal benchmark of mainstream commercial SDKs/APIs, open-source tools, and free online tools for PDF table extraction, producing quantifiable comparison results based on unified test samples.


Test PDF Description

Test File: Scanned PDF (SEO: The Art of SEO 3rd Edition, English, Page 114)

Output note: Screenshot of original scanned test file (tools that cannot process scanned files were tested on other text-based PDF tables instead)

Property Value
Type Scanned (image-only, no text layer)
Page Size 504 x 661.5 pt
Embedded Image 1008 x 1323 px, RGB, 8bit
Chars/Lines/Rects 0 / 0 / 0
Actual Content 1 table with 4 columns and multiple rows

This sample represents a high-difficulty test scenario: a scanned PDF with no text layer, semi-bordered table structure, and hierarchical headers. It requires tools to handle both OCR recognition and table structure reconstruction — most pure-parsing Python libraries cannot process it directly.


Online PDF Table Extraction Tools

Quick Selection Guide

Use Case Recommended Tool Alternative Rationale
Daily simple table to Excel SmallPDF iLovePDF Shortest workflow, upload and get results
Privacy-sensitive documents PDF24 Completely free, strong privacy protection
Developer API integration ComPDF etc. PDFTables Provides REST API for programmable calls
Complex tables (merged cells/hierarchical headers) ComPDF or other commercial SDKs Online tools generally perform poorly on complex tables; professional SDKs with structure reconstruction are recommended

1. ExtractTable

Item Details
URL extracttable.com
Type Cloud API + Web Demo
Scanned Support Yes (OCR)
Pricing Credit-based, from 50 credits/$3

Web demo supports images only (JPG/PNG); paid version supports PDF. Output format: CSV/Excel.

Test Results: Due to demo limitations (images only, 2/day), full scanned PDF testing was not possible. Tested with image input instead — basic continuous text extraction was usable, but = was misrecognized as , and bold styling, cell sizing, unordered lists, and other formatting were all lost.

According to Mark Kramer's benchmark, ExtractTable has issues with merged cell content misalignment and missing data.


2. SmallPDF

Item Details
URL smallpdf.com
Type Online web tool
Scanned Support Yes (OCR, Pro version, 7-day free trial)
Free Limit 2/day free, Pro $12/month

Offers PDF to Excel conversion with easy workflow. However, recognition capability is limited with complex structures like merged cells and rotated text.

Test Results: Performed well among online tools — table structure, merged cells, vertical text, and superscript/subscript were all effectively recognized.


3. iLovePDF

Item Details
URL ilovepdf.com
Type Online web tool
Scanned Support Yes (OCR, paid version)
Free Limit 2/hour free, Pro $6/month

Offers PDF to Excel, PDF to Word, and multi-format conversion. Free version does not include OCR.

Test Results: OCR recognition accuracy was insufficient — some text areas were converted successfully, but significant content remained embedded as raw image slices within tables, failing to achieve true structured output.


4. PDF24 Tools

Item Details
URL tools.pdf24.org
Type Online + Desktop client
Scanned Support Limited
Pricing Completely free

German-developed free PDF toolset offering PDF to Excel conversion with no file size limits and strong privacy protection. Limited support for complex table structures.

Test Results: Scanned documents were not OCR-processed — images were embedded directly into Excel output. Table structure recognition failed, with text loss and incorrect cell merge logic.


5. PDFTables.com (Online)

Item Details
URL pdftables.com
Type Online web + API
Scanned Support No OCR
Pricing Credit-based, $50/1000 pages

Supports drag-and-drop upload and conversion. Good conversion quality for standard bordered tables, but does not support scanned documents and has no free trial.


Commercial PDF Table Extraction Tools

Quick Selection Guide

Use Case Recommended Tool Rationale
Complex merged cells/hierarchical headers/style preservation ComPDF Only commercial SDK verified across hierarchical headers, merged cells, and style preservation
Cloud-native/high throughput/AWS ecosystem AWS Textract Deep AWS integration, pay-per-use, suitable for elastic throughput
Cross-platform SDK integration (Web/Mobile) ComPDF Native cross-platform SDK, suitable for embedding
Desktop occasional use Adobe Acrobat Most widely used PDF desktop tool
Enterprise full-stack document processing ComPDF Full-platform SDK, enterprise-grade private deployment

1. ComPDF (Recommended)

Item Details
Product ComPDF SDK / API
SDK Languages Python, Java, Go, iOS, Android, C#
Pricing Contact sales

Core Capabilities (Source: ComPDF Official)

ComPDF is one of the few commercial table extraction SDKs covering all three of the following capabilities:

Capability Dimension Support Status
Table Type Coverage Bordered, irregular border, borderless tables
Complex Merged Cells Cross-row/column merged cell structure reconstruction
Content Preservation Simultaneous text and image extraction from cells
Style Preservation Font/typeface/size/color/bold-italic full preservation

Third-Party Benchmark Context: Mark Kramer from MITRE tested 12 mainstream tools in a horizontal benchmark, with the following conclusion (Source: Medium):

"Among all the commercial solutions, ComPDF was the only tool to correctly capture the hierarchical column headers."

Our Test Results:

Evaluation Item Conclusion
Hierarchical Merged Headers Correctly captured, best performance among all commercial tools tested
Row/Column Merging Cross-row/column merge logic fully reconstructed
Text Style Font, size, bold/italic largely preserved
Table Borders Correctly identified and reconstructed border positions and line styles
Row/Column Dimensions Column widths and row heights consistent with original PDF
Known Limitations Footnote attribution, superscript text, and rotated text recognition still need improvement
SDK Coverage 6 languages (Python / Java / Go / iOS / Android / C#)

2. AWS Textract

Item Details
Product Amazon Textract API
Free Tier 100 pages/month for new users (3 months) Pricing
Pricing $0.015/page (Table mode)
import boto3
client = boto3.client('textract')
response = client.analyze_document(
    Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'invoice.pdf'}},
    FeatureTypes=['TABLES', 'FORMS']
)
Enter fullscreen mode Exit fullscreen mode

Evaluation:

  • Basic Table Recognition: Good performance on text-based PDF bordered tables, accurate structure reconstruction
  • Scanned OCR: Leveraging AWS's underlying OCR engine, scanned document processing is at the top tier of commercial APIs
  • Merged Cells: Limited support; hierarchical header scenarios show deviations from individual cell outputs
  • Ecosystem Integration: Native connectivity with AWS services (S3/Lambda/SageMaker), suitable for teams with existing AWS infrastructure

3. Nanonets

Item Details
Product Nanonets API
Pricing $0.10-0.30/run Pricing

Third-Party Reference: Mark Kramer's benchmark shows lower omission rates than ExtractTable, but footnote content is output as garbled text, and merged cells cannot correctly express hierarchical relationships.

Our Test: Continuous text recognition is basically usable; basic text styles (font/typeface/size) are preserved; text color, table border line styles, and other formatting are not reconstructed.


4. Nutrient (formerly PSPDFKit)

Item Details
Product Nutrient SDK / API
URL nutrient.io
Positioning Enterprise PDF SDK (cross-platform)
Platforms Web, iOS, Android, Windows, macOS
Pricing Contact sales (enterprise)

Nutrient (formerly PSPDFKit) is a well-known cross-platform PDF SDK vendor, with core capabilities focused on PDF rendering, annotation, and editing. Table extraction is provided as an API module.

Product Positioning:

  • Cross-platform native SDK with outstanding PDF rendering and interaction performance
  • Table extraction is not its core scenario; custom development and integration are required
  • Suitable for teams with existing Nutrient deployments needing supplementary table capabilities

Test Results: Text content recognition accuracy was insufficient, with issues including positional drift, spacing distortion, and partial text loss. Hierarchical relationships between headers and content were not correctly reconstructed.


5. Adobe Acrobat

Item Details
Product Adobe Acrobat Pro DC
Pricing Subscription ~$19.99/month
Scanned Support Built-in OCR (Pro version)
Use Case Desktop occasional, small-scale processing

Product Features:

  • Strengths: Low barrier to entry, no coding required; Pro version includes OCR for direct scanned document export
  • Limitations: No batch automation API; output quality is unstable with merged cells

Test Results: Overall table structure was reconstructed, but text loss and individual character recognition errors were present. Table border line styles and cell formatting were not preserved.


6. iText (iText 8 Core / iText 7 Community)

Item Details
Product iText 8 Core / iText 7 Community
URL itextpdf.com
License AGPL (free open-source) / Commercial license
Platforms Java, .NET (C#)
Pricing AGPL free / Commercial contact sales

iText is one of the oldest PDF processing libraries (founded in 1998). iText does not provide a dedicated table extraction API — you must use LocationTextExtractionStrategy to parse text positions and infer table structure.

Test Results Across Three PDFs:

Test File Type Extraction Result Table Structure
SEO Book Page 114 Scanned 0 chars
Transcript (1).pdf Text-based transcript 2218+866 chars Continuous text stream
Prot_000 8.pdf Text-based clinical protocol 1062 chars Continuous text stream

Capability Summary:

Scenario Result
Text-based PDF text extraction 4/5 — mature text layer extraction
Automatic table recognition Requires custom development (coordinate-based inference)
Scanned OCR No built-in OCR
PDF creation/editing 5/5 — industry benchmark

Key Takeaway: iText is mature in the domain of low-level PDF operations, but it is not an out-of-the-box table extraction tool. Text content extraction (cumulative 4,146 characters) is complete and reliable, but table structure (column alignment, merged cells, rotated text) is entirely lost. For structured table output, use dedicated table reconstruction tools like ComPDF (commercial), Camelot, or Docling (open-source).


Open-Source PDF Table Extraction Tools

Quick Selection Guide

Use Case Recommended Tool Rationale
Scanned/image PDF table extraction Docling 9.39s实测完成扫描件表格提取,AI pipeline integration ready
Standard bordered text-based PDF tables Camelot Simple lattice mode setup, extraction in a few lines of code
Complex tables (merged cells/rotated text) pdfplumber + custom code Requires fine-tuning and custom post-processing
Java ecosystem/existing Java projects tabula-py / iText Naturally compatible with Java tech stack
Full document understanding (AI pipeline) Docling Native integration with LangChain/LlamaIndex
Zero budget Any open-source tool All MIT licensed, no licensing fees

1. Docling (IBM)

Item Details
Version v2.99.0 (June 8, 2026) GitHub
License MIT
GitHub Stars 61,200+ — fastest growing PDF open-source project
Dependencies Python 3.10+, models require initial download (~2-5GB)

Core Features (Source: GitHub README + Technical Report)

  • Multi-format support (PDF/DOCX/PPTX/XLSX/HTML/images/audio/email etc.)
  • Built-in TableFormer (claims 93.6% accuracy vs Tabula 67.9%, Camelot 73.0%)
  • Built-in OCR (via RapidOCR for scanned documents)
  • Integrations: LangChain / LlamaIndex / Crew AI / Haystack

Test Results (June 2026)

Environment: Models downloaded successfully via HF_ENDPOINT=https://hf-mirror.com

  • Conversion time: 9.39s
  • Markdown length: 2573 chars
  • Heron layout model: loaded (770/770 weights)
  • OCR engine: RapidOCR
  • Table detection: successful

Conclusions:

  • Table Structure Recognition: 4-column multi-row table structure fully preserved, row-column correspondence correct
  • Scanned Document Processing: Successfully extracted a structured table from a pure image PDF, validating AI model feasibility for scanned documents
  • OCR English Accuracy: Low due to RapidOCR's optimization for Chinese characters; English recognition drift (e.g., "Google" misrecognized as "Googfe") is the current version's main bottleneck

Improvement Direction: Pairing with an English-specialized OCR engine (e.g., Tesseract English model) would significantly improve English table recognition accuracy on scanned documents.


2. pdfplumber

Item Details
Version v0.11.9 (January 2026) PyPI
License MIT, based on pdfminer.six
Positioning Character-level low-level PDF parsing engine

Test Results (June 2026):

Chars: 0, Lines: 0, Rects: 0, Tables: 0
Enter fullscreen mode Exit fullscreen mode

Cannot process scanned files — 0 characters, 0 lines, 0 tables. Consistent with official documentation: "Works best on machine-generated, rather than scanned, PDFs".

Text-based PDF Tests (New):

Transcript (1).pdf (Student transcript, 2 pages):

Chars: 2905 | Tables: 0 | Time: 0.13s
Enter fullscreen mode Exit fullscreen mode

pdfplumber successfully extracted 2905 characters of text but detected zero tables (0 tables). This is because pdfplumber's table detection depends on graphical lines (lines/rects), while this transcript uses borderless tables ("ghost tables") where the visual table structure is formed purely by text alignment.

Prot_000 8.pdf (Clinical protocol schedule, 1 page):

Chars: 1017 | Tables: 1 | Time: 1.19s
Enter fullscreen mode Exit fullscreen mode

Successfully detected 1 table (Schedule of Events), benefiting from the table's complete border lines. However, merged cell information (e.g., cross-column time axis headers) was lost, and rotated text could not be recognized.

Conclusion: pdfplumber detects bordered tables accurately (consistent with Mark Kramer's findings), but is completely ineffective on borderless tables. As a low-level library, significant custom code is required for complex scenarios.


3. Camelot

Item Details
Version v2.0.0 (June 4, 2026) PyPI
License MIT
Dependencies Python 3.10+, PyTorch optional for ML mode

5 Parsers: lattice (bordered) / stream (borderless) / network (text alignment) / ml (Table Transformer) / auto (automatic)

Test Results:

  • Lattice Mode: Cannot process scanned files; parser explicitly rejects image-based page input

Text-based PDF Tests (New):

Transcript (1).pdf (transcript, borderless table):

Mode Result
lattice ❌ 0 tables — no table lines detected
stream 1 table, 95.8% accuracy — 0.07s, successfully identified borderless table

Prot_000 8.pdf (clinical protocol, bordered table):

Mode Result
lattice 1 table, 97.97% accuracy — 0.36s, complete structure
stream 2 tables, 100% accuracy — 0.08s, fastest speed

Conclusion: The extracted Excel files showed disorganized structure and text-to-cell mapping misalignment.


4. tabula-py

Item Details
Version v2.10.0 (October 2024) PyPI
License MIT
Dependency Java 8+ + Python 3.9+

Test Results (Java 21 + tabula-py 2.10.0):

Tables found: 0
Enter fullscreen mode Exit fullscreen mode

Cannot process scanned files. Even with Java installed, tabula-py cannot parse image-based PDFs without a text layer. Tabula's official documentation states: "Tabula only works on text-based PDFs, not scanned documents."

Text-based PDF Tests (New):

Transcript (1).pdf (Student transcript, 2 pages):

Tables found: 2 | Time: 1.54s
Table 1: 29 rows x 7 cols (transfer credits)
Table 2: 8 rows x 7 cols (GE 2022 semester)
Enter fullscreen mode Exit fullscreen mode

tabula-py successfully identified the two semester course tables as separate DataFrames with complete column structure (Course Code / Description / Credits / Grade / Quality Points). All 29 transfer credit course records were correctly captured.

Prot_000 8.pdf (Clinical protocol schedule, 1 page):

ERROR: 'utf-8' codec can't decode byte 0xa1
Enter fullscreen mode Exit fullscreen mode

tabula-py threw an encoding exception when processing this PDF. The PDF contains special characters (e.g., registered trademark symbol ®, ±), and tabula-py's Java subprocess output decoding failed.

Summary: tabula-py's text recognition for standard tables (e.g., transcripts) is basically correct, but the structure reconstruction is disorganized, text-to-row/column mapping is misaligned, and table borders were not correctly identified.


Conclusion

This article presents a horizontal benchmark of 15 tools using three test files of varying difficulty (scanned no-text-layer PDF / text-based transcript PDF / text-based clinical protocol PDF). The key findings are:

1. Scanned Documents Remain the Biggest Differentiator

  • Of the 15 tools tested, only Docling, ComPDF, AWS Textract, Nanonets, Nutrient, and Adobe Acrobat can process scanned documents
  • Pure-parsing tools like pdfplumber, Camelot(lattice), tabula-py, and iText completely fail on scanned documents (0 chars/0 tables)
  • OCR capability is the first differentiator for PDF table extraction tools

2. Table Structure Reconstruction on Text-Based PDFs Is the Real Test

  • Even among tools that successfully extract text from text-based PDFs, very few can fully preserve table structure (row-column correspondence, merged cells, border styles)
  • Docling produced 27 Markdown tables (18,048 chars) from the text transcript and 21 tables (12,922 chars) from the clinical protocol — its AI-driven layout analysis significantly outperforms other open-source tools
  • Camelot achieves the highest accuracy on bordered tables (lattice 97.97%), and its stream mode works on borderless tables (95.8%) — the best choice among pure-parsing tools
  • tabula-py correctly identifies columns on standard transcript tables but has encoding compatibility issues
  • iText extracts text completely (4,146 characters cumulative), but table structure is entirely lost — only continuous text stream output

3. The Gap Between Commercial and Open-Source Tools Is Largest for Complex Tables

  • For merged cells/hierarchical headers, ComPDF is the only commercially available SDK that passed verification
  • Open-source tools are adequate for simple bordered tables, but the gap widens significantly for complex table scenarios

Top comments (0)