Document parsing is the foundation of enterprise AI applications. Whether you're building RAG pipelines, automating insurance claims, or extracting data from financial reports, everything starts with one question: Can you consistently transform messy, real-world documents into structured, machine-readable data?
Our customers need the best document ingestion API for their use cases. They're comparing Azure, AWS Textract, popular open-source models like Docling and Marker.
We built a benchmark that measures what matters: Can downstream systems actually use this output?
Measuring What Actually Matters
Tensorlake both reads documents and extracts structured data, so when choosing what to measure accuracy with, we wanted to ensure we were measuring both document parsing with structural preservation and structured extraction for downstream usability.
The aspects of Document Parsing that we wanted to measure were:
- Tables: Ensuring we can parse and measure accuracy of complex tables with merged cells and multi-row headers
- Reading Order: In multi-column documents, and documents with complex layouts, we measure whether the reading order is preserved while parsing.
- Structured Extraction Accuracy: Measuring direct downstream usability of extracted data. A small OCR error in parsing a table cell can cause failure in achieving the downstream task, while the overall accuracy of the OCR on the document may be high.
- Extraction of footnotes, formulas, figures and other non-textual content.
Our Evaluation Methodology
We employ two metrics that better capture these features with real-world reliability:
TEDS (Tree Edit Distance Similarity)
- Compares predicted and ground-truth Markdown/HTML tree structures
- Captures structural fidelity in tables and complex layouts
- Widely adopted in OCRBench v2 and OmniDocBench evaluations
- Measures whether the document's logical structure and textual alignment remains intact
TEDS answers: "Is this table still a table?" Not just "Is the text similar?"
JSON F1 (Field-Level Precision and Recall)
- Compares extracted JSON against schema-based ground truth
- Precision measures correctness of extracted fields
- Recall measures completeness of required field capture
- F1 score balances both for overall reliability assessment
JSON F1 answers: "Can downstream automation actually use this data?" Not just "Is some text present?"
Together, these metrics answer the essential question: "Can downstream systems use this output?" rather than simply "Is the text similar?"
Stage 1: Document Reading Ability (OCR and Structural Preservation)
Each parsing model generates Markdown/HTML output. We evaluate using TEDS to measure how well structure is preserved; reading order, table integrity, and layout coherence. You can find our updated dataset published here.
We use the public OCRBench v2 and OmniDocBench datasets. However, upon review, we identified inconsistencies in the published ground truth of OCRBench v2. We conducted a comprehensive audit and correction to ensure evaluation accuracy.
Stage 2: Structured Extraction Accuracy (Downstream Usability)
We pass the Markdown through a standardized LLM (GPT-4o) with predefined JSON schemas, measuring JSON F1. This isolates how OCR quality impacts real extraction workflows, where an LLM interprets the parsed text.
Initial JSON schemas and reference answers are generated using Gemini Pro 2.5, then human reviewers audit and correct them to ensure high-quality gold standards.
This methodology ensures fair, reproducible comparisons by varying only the OCR models (Stage 1) while keeping the extraction model constant (Stage 2).
The Results: Public Dataset Performance
Document Parsing Performance
We evaluated leading open-source and proprietary models:
Key Findings:
- Tensorlake achieves the highest TEDS score, indicating superior structural preservation
- The gap between Docling and production-grade systems is substantial
Table Parsing Performance
We evaluated Tensorlake’s table parsing accuracy using the OmniDocBench dataset — a CVPR-accepted benchmark for comprehensive document understanding tasks (GitHub link).
Table accuracy in OmniDocBench is quantified using a combination of tree-based and string-based metrics. In particular, we measured TEDS (Tree Edit Distance Similarity), which assesses both the structural and textual alignment between predicted and ground-truth HTML tables.
To reproduce our results, generate Markdown outputs using the models listed below, then run the evaluation method provided in the OmniDocBench repository. We have used 512 document images with tables and v1.5 of the code version. Evaluation outputs are released in Huggingface(link)
¹ Marker's Number is from the officially published OmniDocBench repository.
Key Findings:
- On OmniDocBench's challenging tables, Tensorlake leads with 86.79% TEDS
- Open-source solutions struggle with table extraction (sub-70% TEDS)
- Tensorlake maintains table structure even on complex, multi-page tables
Performance on Real World Enterprise Documents
OCR Models are rarely trained on enterprise documents, because they are not publicly available. We wanted to test how well our model performs and others perform on these documents.
Enterprise Document Performance (100 pages)
We curated 100 document pages spanning banking, retail, and insurance sectors. This represents real production workloads: invoices with water damage, scanned contracts with skewed text, bank statements with multi-level tables.
Key Findings:
- Tensorlake achieves 91.7% F1 with standard extraction, beating all competitors
- The difference between 91.7% and 68.9% F1 is massive: it’s 5 extra fields correctly extracted out of every 20
- In production workflows processing thousands of documents daily, this accuracy gap compounds into significant error reduction
But even comparing the higher F1 scores when parsing a standard form, Azure and Textract jumble the reading order and skip data completely, whereas Tensorlake preserves the complex reading order and groups data correctly and accurately:
Delivering the Best Performance/Price Ratio
Accuracy without affordability isn't practical. Here's how Tensorlake compares to other Document Ingestion APIs:
-
Tensorlake: $10 per 1k pages
- TEDS Score: 86.79
- F1 Score: 91.7
-
Azure: $10 per 1k pages
- TEDS Score: 78.14
- F1 Score: 88.1
-
AWS Textract: $15 per 1k pages
- TEDS Score: 80.75
- F1 Score: 88.4
Tensorlake delivers the highest accuracy than both Azure and AWS Textract, matching Azure's cost while AWS Textract is 50% more expensive.
Take the Next Step
When your business depends on accurate document processing, you can't afford to use anything less.
Want to discuss your specific use case?
- Schedule a technical demo with our team.
Questions about the benchmark?




Top comments (0)