Benchmarking the Most Reliable Document Parsing API

#api #rag #ai #performance

Document parsing is the foundation of enterprise AI applications. Whether you're building RAG pipelines, automating insurance claims, or extracting data from financial reports, everything starts with one question: Can you consistently transform messy, real-world documents into structured, machine-readable data?

Our customers need the best document ingestion API for their use cases. They're comparing Azure, AWS Textract, popular open-source models like Docling and Marker.

We built a benchmark that measures what matters: Can downstream systems actually use this output?

Measuring What Actually Matters

Tensorlake both reads documents and extracts structured data, so when choosing what to measure accuracy with, we wanted to ensure we were measuring both document parsing with structural preservation and structured extraction for downstream usability.

The aspects of Document Parsing that we wanted to measure were:

Tables: Ensuring we can parse and measure accuracy of complex tables with merged cells and multi-row headers
Reading Order: In multi-column documents, and documents with complex layouts, we measure whether the reading order is preserved while parsing.
Structured Extraction Accuracy: Measuring direct downstream usability of extracted data. A small OCR error in parsing a table cell can cause failure in achieving the downstream task, while the overall accuracy of the OCR on the document may be high.
Extraction of footnotes, formulas, figures and other non-textual content.

Our Evaluation Methodology

We employ two metrics that better capture these features with real-world reliability:

TEDS (Tree Edit Distance Similarity)

Compares predicted and ground-truth Markdown/HTML tree structures
Captures structural fidelity in tables and complex layouts
Widely adopted in OCRBench v2 and OmniDocBench evaluations
Measures whether the document's logical structure and textual alignment remains intact

TEDS answers: "Is this table still a table?" Not just "Is the text similar?"

JSON F1 (Field-Level Precision and Recall)

Compares extracted JSON against schema-based ground truth
Precision measures correctness of extracted fields
Recall measures completeness of required field capture
F1 score balances both for overall reliability assessment

JSON F1 answers: "Can downstream automation actually use this data?" Not just "Is some text present?"

Together, these metrics answer the essential question: "Can downstream systems use this output?" rather than simply "Is the text similar?"

Stage 1: Document Reading Ability (OCR and Structural Preservation)

Each parsing model generates Markdown/HTML output. We evaluate using TEDS to measure how well structure is preserved; reading order, table integrity, and layout coherence. You can find our updated dataset published here.

We use the public OCRBench v2 and OmniDocBench datasets. However, upon review, we identified inconsistencies in the published ground truth of OCRBench v2. We conducted a comprehensive audit and correction to ensure evaluation accuracy.

Stage 2: Structured Extraction Accuracy (Downstream Usability)

We pass the Markdown through a standardized LLM (GPT-4o) with predefined JSON schemas, measuring JSON F1. This isolates how OCR quality impacts real extraction workflows, where an LLM interprets the parsed text.

Initial JSON schemas and reference answers are generated using Gemini Pro 2.5, then human reviewers audit and correct them to ensure high-quality gold standards.

This methodology ensures fair, reproducible comparisons by varying only the OCR models (Stage 1) while keeping the extraction model constant (Stage 2).

The Results: Public Dataset Performance

Document Parsing Performance

We evaluated leading open-source and proprietary models:

Key Findings:

Tensorlake achieves the highest TEDS score, indicating superior structural preservation
The gap between Docling and production-grade systems is substantial

Table Parsing Performance

We evaluated Tensorlake’s table parsing accuracy using the OmniDocBench dataset — a CVPR-accepted benchmark for comprehensive document understanding tasks (GitHub link).

Table accuracy in OmniDocBench is quantified using a combination of tree-based and string-based metrics. In particular, we measured TEDS (Tree Edit Distance Similarity), which assesses both the structural and textual alignment between predicted and ground-truth HTML tables.

To reproduce our results, generate Markdown outputs using the models listed below, then run the evaluation method provided in the OmniDocBench repository. We have used 512 document images with tables and v1.5 of the code version. Evaluation outputs are released in Huggingface(link)

¹ Marker's Number is from the officially published OmniDocBench repository.

Key Findings:

On OmniDocBench's challenging tables, Tensorlake leads with 86.79% TEDS
Open-source solutions struggle with table extraction (sub-70% TEDS)
Tensorlake maintains table structure even on complex, multi-page tables

Performance on Real World Enterprise Documents

OCR Models are rarely trained on enterprise documents, because they are not publicly available. We wanted to test how well our model performs and others perform on these documents.

Enterprise Document Performance (100 pages)

We curated 100 document pages spanning banking, retail, and insurance sectors. This represents real production workloads: invoices with water damage, scanned contracts with skewed text, bank statements with multi-level tables.

Key Findings:

Tensorlake achieves 91.7% F1 with standard extraction, beating all competitors
The difference between 91.7% and 68.9% F1 is massive: it’s 5 extra fields correctly extracted out of every 20
In production workflows processing thousands of documents daily, this accuracy gap compounds into significant error reduction

But even comparing the higher F1 scores when parsing a standard form, Azure and Textract jumble the reading order and skip data completely, whereas Tensorlake preserves the complex reading order and groups data correctly and accurately: