PDF was designed for "what you see is what you get" document exchange, not structured data — this creates fundamental barriers for table extraction:
- PDF lacks a table semantic model: Tables in PDFs are merely collections of lines and characters, with no abstract definition of rows, columns, or cells
- Complex table structures pose typical challenges: Merged cells, rotated text, cross-page tables, borderless tables — each of these scenarios causes most tools to produce severely degraded output
- Scanned vs. text-based PDFs are fundamentally different: The former relies on OCR engines to convert images to text, while the latter can directly parse the character encoding layer — each demands very different tool capabilities
This article presents a horizontal benchmark of mainstream commercial SDKs/APIs, open-source tools, and free online tools for PDF table extraction, producing quantifiable comparison results based on unified test samples.
Test PDF Description
Test File: Scanned PDF (SEO: The Art of SEO 3rd Edition, English, Page 114)
Output note: Screenshot of original scanned test file (tools that cannot process scanned files were tested on other text-based PDF tables instead)
| Property | Value |
|---|---|
| Type | Scanned (image-only, no text layer) |
| Page Size | 504 x 661.5 pt |
| Embedded Image | 1008 x 1323 px, RGB, 8bit |
| Chars/Lines/Rects | 0 / 0 / 0 |
| Actual Content | 1 table with 4 columns and multiple rows |
This sample represents a high-difficulty test scenario: a scanned PDF with no text layer, semi-bordered table structure, and hierarchical headers. It requires tools to handle both OCR recognition and table structure reconstruction — most pure-parsing Python libraries cannot process it directly.
Online PDF Table Extraction Tools
Quick Selection Guide
| Use Case | Recommended Tool | Alternative | Rationale |
|---|---|---|---|
| Daily simple table to Excel | SmallPDF | iLovePDF | Shortest workflow, upload and get results |
| Privacy-sensitive documents | PDF24 | — | Completely free, strong privacy protection |
| Developer API integration | ComPDF etc. | PDFTables | Provides REST API for programmable calls |
| Complex tables (merged cells/hierarchical headers) | ComPDF or other commercial SDKs | — | Online tools generally perform poorly on complex tables; professional SDKs with structure reconstruction are recommended |
1. ExtractTable
| Item | Details |
|---|---|
| URL | extracttable.com |
| Type | Cloud API + Web Demo |
| Scanned Support | Yes (OCR) |
| Pricing | Credit-based, from 50 credits/$3 |
Web demo supports images only (JPG/PNG); paid version supports PDF. Output format: CSV/Excel.
Test Results: Due to demo limitations (images only, 2/day), full scanned PDF testing was not possible. Tested with image input instead — basic continuous text extraction was usable, but = was misrecognized as —, and bold styling, cell sizing, unordered lists, and other formatting were all lost.
According to Mark Kramer's benchmark, ExtractTable has issues with merged cell content misalignment and missing data.
2. SmallPDF
| Item | Details |
|---|---|
| URL | smallpdf.com |
| Type | Online web tool |
| Scanned Support | Yes (OCR, Pro version, 7-day free trial) |
| Free Limit | 2/day free, Pro $12/month |
Offers PDF to Excel conversion with easy workflow. However, recognition capability is limited with complex structures like merged cells and rotated text.
Test Results: Performed well among online tools — table structure, merged cells, vertical text, and superscript/subscript were all effectively recognized.
3. iLovePDF
| Item | Details |
|---|---|
| URL | ilovepdf.com |
| Type | Online web tool |
| Scanned Support | Yes (OCR, paid version) |
| Free Limit | 2/hour free, Pro $6/month |
Offers PDF to Excel, PDF to Word, and multi-format conversion. Free version does not include OCR.
Test Results: OCR recognition accuracy was insufficient — some text areas were converted successfully, but significant content remained embedded as raw image slices within tables, failing to achieve true structured output.
4. PDF24 Tools
| Item | Details |
|---|---|
| URL | tools.pdf24.org |
| Type | Online + Desktop client |
| Scanned Support | Limited |
| Pricing | Completely free |
German-developed free PDF toolset offering PDF to Excel conversion with no file size limits and strong privacy protection. Limited support for complex table structures.
Test Results: Scanned documents were not OCR-processed — images were embedded directly into Excel output. Table structure recognition failed, with text loss and incorrect cell merge logic.
5. PDFTables.com (Online)
| Item | Details |
|---|---|
| URL | pdftables.com |
| Type | Online web + API |
| Scanned Support | No OCR |
| Pricing | Credit-based, $50/1000 pages |
Supports drag-and-drop upload and conversion. Good conversion quality for standard bordered tables, but does not support scanned documents and has no free trial.
Commercial PDF Table Extraction Tools
Quick Selection Guide
| Use Case | Recommended Tool | Rationale |
|---|---|---|
| Complex merged cells/hierarchical headers/style preservation | ComPDF | Only commercial SDK verified across hierarchical headers, merged cells, and style preservation |
| Cloud-native/high throughput/AWS ecosystem | AWS Textract | Deep AWS integration, pay-per-use, suitable for elastic throughput |
| Cross-platform SDK integration (Web/Mobile) | ComPDF | Native cross-platform SDK, suitable for embedding |
| Desktop occasional use | Adobe Acrobat | Most widely used PDF desktop tool |
| Enterprise full-stack document processing | ComPDF | Full-platform SDK, enterprise-grade private deployment |
1. ComPDF (Recommended)
| Item | Details |
|---|---|
| Product | ComPDF SDK / API |
| SDK Languages | Python, Java, Go, iOS, Android, C# |
| Pricing | Contact sales |
Core Capabilities (Source: ComPDF Official)
ComPDF is one of the few commercial table extraction SDKs covering all three of the following capabilities:
| Capability Dimension | Support Status |
|---|---|
| Table Type Coverage | Bordered, irregular border, borderless tables |
| Complex Merged Cells | Cross-row/column merged cell structure reconstruction |
| Content Preservation | Simultaneous text and image extraction from cells |
| Style Preservation | Font/typeface/size/color/bold-italic full preservation |
Third-Party Benchmark Context: Mark Kramer from MITRE tested 12 mainstream tools in a horizontal benchmark, with the following conclusion (Source: Medium):
"Among all the commercial solutions, ComPDF was the only tool to correctly capture the hierarchical column headers."
Our Test Results:
| Evaluation Item | Conclusion |
|---|---|
| Hierarchical Merged Headers | Correctly captured, best performance among all commercial tools tested |
| Row/Column Merging | Cross-row/column merge logic fully reconstructed |
| Text Style | Font, size, bold/italic largely preserved |
| Table Borders | Correctly identified and reconstructed border positions and line styles |
| Row/Column Dimensions | Column widths and row heights consistent with original PDF |
| Known Limitations | Footnote attribution, superscript text, and rotated text recognition still need improvement |
| SDK Coverage | 6 languages (Python / Java / Go / iOS / Android / C#) |
2. AWS Textract
| Item | Details |
|---|---|
| Product | Amazon Textract API |
| Free Tier | 100 pages/month for new users (3 months) Pricing |
| Pricing | $0.015/page (Table mode) |
import boto3
client = boto3.client('textract')
response = client.analyze_document(
Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'invoice.pdf'}},
FeatureTypes=['TABLES', 'FORMS']
)
Evaluation:
- Basic Table Recognition: Good performance on text-based PDF bordered tables, accurate structure reconstruction
- Scanned OCR: Leveraging AWS's underlying OCR engine, scanned document processing is at the top tier of commercial APIs
- Merged Cells: Limited support; hierarchical header scenarios show deviations from individual cell outputs
- Ecosystem Integration: Native connectivity with AWS services (S3/Lambda/SageMaker), suitable for teams with existing AWS infrastructure
3. Nanonets
| Item | Details |
|---|---|
| Product | Nanonets API |
| Pricing | $0.10-0.30/run Pricing |
Third-Party Reference: Mark Kramer's benchmark shows lower omission rates than ExtractTable, but footnote content is output as garbled text, and merged cells cannot correctly express hierarchical relationships.
Our Test: Continuous text recognition is basically usable; basic text styles (font/typeface/size) are preserved; text color, table border line styles, and other formatting are not reconstructed.
4. Nutrient (formerly PSPDFKit)
| Item | Details |
|---|---|
| Product | Nutrient SDK / API |
| URL | nutrient.io |
| Positioning | Enterprise PDF SDK (cross-platform) |
| Platforms | Web, iOS, Android, Windows, macOS |
| Pricing | Contact sales (enterprise) |
Nutrient (formerly PSPDFKit) is a well-known cross-platform PDF SDK vendor, with core capabilities focused on PDF rendering, annotation, and editing. Table extraction is provided as an API module.
Product Positioning:
- Cross-platform native SDK with outstanding PDF rendering and interaction performance
- Table extraction is not its core scenario; custom development and integration are required
- Suitable for teams with existing Nutrient deployments needing supplementary table capabilities
Test Results: Text content recognition accuracy was insufficient, with issues including positional drift, spacing distortion, and partial text loss. Hierarchical relationships between headers and content were not correctly reconstructed.
5. Adobe Acrobat
| Item | Details |
|---|---|
| Product | Adobe Acrobat Pro DC |
| Pricing | Subscription ~$19.99/month |
| Scanned Support | Built-in OCR (Pro version) |
| Use Case | Desktop occasional, small-scale processing |
Product Features:
- Strengths: Low barrier to entry, no coding required; Pro version includes OCR for direct scanned document export
- Limitations: No batch automation API; output quality is unstable with merged cells
Test Results: Overall table structure was reconstructed, but text loss and individual character recognition errors were present. Table border line styles and cell formatting were not preserved.
6. iText (iText 8 Core / iText 7 Community)
| Item | Details |
|---|---|
| Product | iText 8 Core / iText 7 Community |
| URL | itextpdf.com |
| License | AGPL (free open-source) / Commercial license |
| Platforms | Java, .NET (C#) |
| Pricing | AGPL free / Commercial contact sales |
iText is one of the oldest PDF processing libraries (founded in 1998). iText does not provide a dedicated table extraction API — you must use LocationTextExtractionStrategy to parse text positions and infer table structure.
Test Results Across Three PDFs:
| Test File | Type | Extraction Result | Table Structure |
|---|---|---|---|
| SEO Book Page 114 | Scanned | 0 chars | — |
| Transcript (1).pdf | Text-based transcript | 2218+866 chars | Continuous text stream |
| Prot_000 8.pdf | Text-based clinical protocol | 1062 chars | Continuous text stream |
Capability Summary:
| Scenario | Result |
|---|---|
| Text-based PDF text extraction | 4/5 — mature text layer extraction |
| Automatic table recognition | Requires custom development (coordinate-based inference) |
| Scanned OCR | No built-in OCR |
| PDF creation/editing | 5/5 — industry benchmark |
Key Takeaway: iText is mature in the domain of low-level PDF operations, but it is not an out-of-the-box table extraction tool. Text content extraction (cumulative 4,146 characters) is complete and reliable, but table structure (column alignment, merged cells, rotated text) is entirely lost. For structured table output, use dedicated table reconstruction tools like ComPDF (commercial), Camelot, or Docling (open-source).
Open-Source PDF Table Extraction Tools
Quick Selection Guide
| Use Case | Recommended Tool | Rationale |
|---|---|---|
| Scanned/image PDF table extraction | Docling | 9.39s实测完成扫描件表格提取,AI pipeline integration ready |
| Standard bordered text-based PDF tables | Camelot | Simple lattice mode setup, extraction in a few lines of code |
| Complex tables (merged cells/rotated text) | pdfplumber + custom code | Requires fine-tuning and custom post-processing |
| Java ecosystem/existing Java projects | tabula-py / iText | Naturally compatible with Java tech stack |
| Full document understanding (AI pipeline) | Docling | Native integration with LangChain/LlamaIndex |
| Zero budget | Any open-source tool | All MIT licensed, no licensing fees |
1. Docling (IBM)
| Item | Details |
|---|---|
| Version | v2.99.0 (June 8, 2026) GitHub |
| License | MIT |
| GitHub Stars | 61,200+ — fastest growing PDF open-source project |
| Dependencies | Python 3.10+, models require initial download (~2-5GB) |
Core Features (Source: GitHub README + Technical Report)
- Multi-format support (PDF/DOCX/PPTX/XLSX/HTML/images/audio/email etc.)
- Built-in TableFormer (claims 93.6% accuracy vs Tabula 67.9%, Camelot 73.0%)
- Built-in OCR (via RapidOCR for scanned documents)
- Integrations: LangChain / LlamaIndex / Crew AI / Haystack
Test Results (June 2026)
Environment: Models downloaded successfully via HF_ENDPOINT=https://hf-mirror.com
- Conversion time: 9.39s
- Markdown length: 2573 chars
- Heron layout model: loaded (770/770 weights)
- OCR engine: RapidOCR
- Table detection: successful
Conclusions:
- Table Structure Recognition: 4-column multi-row table structure fully preserved, row-column correspondence correct
- Scanned Document Processing: Successfully extracted a structured table from a pure image PDF, validating AI model feasibility for scanned documents
- OCR English Accuracy: Low due to RapidOCR's optimization for Chinese characters; English recognition drift (e.g., "Google" misrecognized as "Googfe") is the current version's main bottleneck
Improvement Direction: Pairing with an English-specialized OCR engine (e.g., Tesseract English model) would significantly improve English table recognition accuracy on scanned documents.
2. pdfplumber
| Item | Details |
|---|---|
| Version | v0.11.9 (January 2026) PyPI |
| License | MIT, based on pdfminer.six |
| Positioning | Character-level low-level PDF parsing engine |
Test Results (June 2026):
Chars: 0, Lines: 0, Rects: 0, Tables: 0
Cannot process scanned files — 0 characters, 0 lines, 0 tables. Consistent with official documentation: "Works best on machine-generated, rather than scanned, PDFs".
Text-based PDF Tests (New):
Transcript (1).pdf (Student transcript, 2 pages):
Chars: 2905 | Tables: 0 | Time: 0.13s
pdfplumber successfully extracted 2905 characters of text but detected zero tables (0 tables). This is because pdfplumber's table detection depends on graphical lines (lines/rects), while this transcript uses borderless tables ("ghost tables") where the visual table structure is formed purely by text alignment.
Prot_000 8.pdf (Clinical protocol schedule, 1 page):
Chars: 1017 | Tables: 1 | Time: 1.19s
Successfully detected 1 table (Schedule of Events), benefiting from the table's complete border lines. However, merged cell information (e.g., cross-column time axis headers) was lost, and rotated text could not be recognized.
Conclusion: pdfplumber detects bordered tables accurately (consistent with Mark Kramer's findings), but is completely ineffective on borderless tables. As a low-level library, significant custom code is required for complex scenarios.
3. Camelot
| Item | Details |
|---|---|
| Version | v2.0.0 (June 4, 2026) PyPI |
| License | MIT |
| Dependencies | Python 3.10+, PyTorch optional for ML mode |
5 Parsers: lattice (bordered) / stream (borderless) / network (text alignment) / ml (Table Transformer) / auto (automatic)
Test Results:
- Lattice Mode: Cannot process scanned files; parser explicitly rejects image-based page input
Text-based PDF Tests (New):
Transcript (1).pdf (transcript, borderless table):
| Mode | Result |
|---|---|
| lattice | ❌ 0 tables — no table lines detected |
| stream | ✅ 1 table, 95.8% accuracy — 0.07s, successfully identified borderless table |
Prot_000 8.pdf (clinical protocol, bordered table):
| Mode | Result |
|---|---|
| lattice | ✅ 1 table, 97.97% accuracy — 0.36s, complete structure |
| stream | ✅ 2 tables, 100% accuracy — 0.08s, fastest speed |
Conclusion: The extracted Excel files showed disorganized structure and text-to-cell mapping misalignment.
4. tabula-py
| Item | Details |
|---|---|
| Version | v2.10.0 (October 2024) PyPI |
| License | MIT |
| Dependency | Java 8+ + Python 3.9+ |
Test Results (Java 21 + tabula-py 2.10.0):
Tables found: 0
Cannot process scanned files. Even with Java installed, tabula-py cannot parse image-based PDFs without a text layer. Tabula's official documentation states: "Tabula only works on text-based PDFs, not scanned documents."
Text-based PDF Tests (New):
Transcript (1).pdf (Student transcript, 2 pages):
Tables found: 2 | Time: 1.54s
Table 1: 29 rows x 7 cols (transfer credits)
Table 2: 8 rows x 7 cols (GE 2022 semester)
tabula-py successfully identified the two semester course tables as separate DataFrames with complete column structure (Course Code / Description / Credits / Grade / Quality Points). All 29 transfer credit course records were correctly captured.
Prot_000 8.pdf (Clinical protocol schedule, 1 page):
ERROR: 'utf-8' codec can't decode byte 0xa1
tabula-py threw an encoding exception when processing this PDF. The PDF contains special characters (e.g., registered trademark symbol ®, ±), and tabula-py's Java subprocess output decoding failed.
Summary: tabula-py's text recognition for standard tables (e.g., transcripts) is basically correct, but the structure reconstruction is disorganized, text-to-row/column mapping is misaligned, and table borders were not correctly identified.
Conclusion
This article presents a horizontal benchmark of 15 tools using three test files of varying difficulty (scanned no-text-layer PDF / text-based transcript PDF / text-based clinical protocol PDF). The key findings are:
1. Scanned Documents Remain the Biggest Differentiator
- Of the 15 tools tested, only Docling, ComPDF, AWS Textract, Nanonets, Nutrient, and Adobe Acrobat can process scanned documents
- Pure-parsing tools like pdfplumber, Camelot(lattice), tabula-py, and iText completely fail on scanned documents (0 chars/0 tables)
- OCR capability is the first differentiator for PDF table extraction tools
2. Table Structure Reconstruction on Text-Based PDFs Is the Real Test
- Even among tools that successfully extract text from text-based PDFs, very few can fully preserve table structure (row-column correspondence, merged cells, border styles)
- Docling produced 27 Markdown tables (18,048 chars) from the text transcript and 21 tables (12,922 chars) from the clinical protocol — its AI-driven layout analysis significantly outperforms other open-source tools
- Camelot achieves the highest accuracy on bordered tables (lattice 97.97%), and its stream mode works on borderless tables (95.8%) — the best choice among pure-parsing tools
- tabula-py correctly identifies columns on standard transcript tables but has encoding compatibility issues
- iText extracts text completely (4,146 characters cumulative), but table structure is entirely lost — only continuous text stream output
3. The Gap Between Commercial and Open-Source Tools Is Largest for Complex Tables
- For merged cells/hierarchical headers, ComPDF is the only commercially available SDK that passed verification
- Open-source tools are adequate for simple bordered tables, but the gap widens significantly for complex table scenarios














Top comments (0)