Your invoicing system needs to ingest scanned purchase orders. Your accounting platform handles contracts with cross-page tables. The text inside these PDFs has to come out as structured data, not just a wall of text, or your downstream code has nothing to act on.
In April 2026, LlamaIndex published their ParseBench benchmark showing vision LLMs with specific prompts outperform traditional OCR on layout-heavy documents. The buzz suggests we should all switch to Gemini 3 Flash or GPT-4o with HTML colspan/rowspan prompts. So I ran the comparison live on a messy 2-page purchase order. The results were not what the headlines suggest.
Want to test it on your own documents? Try the OCR Wizard API with a scanned PDF.
Quick comparison
Same 2-page purchase order, 7 line items, repeated shipping-address sub-headers, item 030 split across the page break. Mat.No identifiers (like ALRD00882) are the codes that matter: get one wrong and you ship the wrong product.
| Approach | Latency | Cost | Codes accurate | Layout |
|---|---|---|---|---|
| OCR API alone | 1.14s | ~$0.001 | 7 of 7 | lost |
| GPT-4o-mini + prompts | 22s | $0.0087 | 1 of 7 | preserved |
| GPT-4o full + prompts | 20s | $0.0228 | 1 of 7 | preserved |
| Hybrid (OCR + GPT-4o-mini) | 23s | $0.002 | 7 of 7 | preserved |
What ParseBench got right
The benchmark tested 14 parsing methods and found prompt design matters more than model size. LlamaParse Agentic scored 84.9, Gemini 3 Flash 71, beating dedicated parsers like AWS Textract (47.9), Google DocAI (50.4), and Azure Document Intelligence (59.6).
The trick: ask the model to emit HTML tables with colspan and rowspan attributes. Here is the approach as runnable code:
import base64
from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
SYSTEM_PROMPT = """You are a document parser. Convert PDFs into clean Markdown.
- Convert tables to HTML using <table>, <tr>, <th>, <td>.
- Use colspan and rowspan to preserve merged cells and hierarchical headers.
- Maintain reading order. Output only the parsed content."""
def encode(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "text", "text": "Parse this document. Merge tables split across pages."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encode('page1.png')}"}},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encode('page2.png')}"}},
]},
],
)
print(resp.choices[0].message.content)
On my test, both GPT-4o-mini and GPT-4o full produced a correctly structured table. The layout claim holds up.
What ParseBench did not stress-test
Per-character fidelity on identifiers. Both vision LLM runs invented Mat.No codes that look plausible but do not match the source:
| Source | GPT-4o-mini | GPT-4o full |
|---|---|---|
| ALRD00882 | ALU000892 | ALUM0088 |
| ALRD00913 | ALU000913 | ALUM00913 |
| ALSQ00716 | ALU050716 | (dropped) |
| ALPL00534 | ALPL005034 | ALPL05034 |
GPT-4o-mini also rewrote 12.700 (a tolerance in mm) as 12,700, three orders of magnitude off. It misread 3658 mm as 356 mm. GPT-4o full fixed those numeric mistakes but still hallucinated the identifiers.
This is not a flaw in the prompts. It is what happens when a language model generates text from pixels: alphanumeric codes have no linguistic regularity, so the model substitutes characters from codes it has seen in similar layouts. Bigger models hallucinate less, but they still hallucinate.
See the full item-by-item output comparison in the complete guide.
The hybrid pipeline
Pure OCR reads every character literally with no language prior, which is why it preserved all 7 codes. But it emits text in a broken reading order on messy layouts. Hybrid splits the work: OCR for fidelity, LLM for layout reconstruction.
Step 1, OCR extracts exact text:
import requests
def ocr_pdf(pdf_path):
with open(pdf_path, "rb") as f:
r = requests.post(
"https://ocr-wizard.p.rapidapi.com/ocr-pdf",
headers={"x-rapidapi-key": "YOUR_KEY", "x-rapidapi-host": "ocr-wizard.p.rapidapi.com"},
files={"pdf_file": f},
data={"first_page": 1, "last_page": 10},
)
pages = r.json()["body"]["pages"]
return "\n\n".join(p["fullText"] for p in pages)
Step 2, the LLM reconstructs structure under a prompt that forbids changing values:
from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_KEY")
SYSTEM_PROMPT = """You receive raw OCR text. The OCR is accurate at the character
level but the reading order is broken. Reconstruct the document as clean HTML.
CRITICAL: Every code, number, identifier, email, and date in your output MUST
appear verbatim in the input. Do NOT invent, modify, or correct any value.
Convert tables to HTML with <table>, <tr>, <th>, <td>, colspan and rowspan.
Merge tables split across pages."""
def reconstruct(ocr_text):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"OCR TEXT:\n{ocr_text}\n\nOutput ONLY the HTML."},
],
)
return resp.choices[0].message.content
# Full pipeline
text = ocr_pdf("purchase_order.pdf")
html = reconstruct(text)
On the same purchase order, this preserved all 7 Mat.No codes, fixed the page-break fragmentation, separated the shipping-address blocks, and produced one well-formed HTML table.
Why hybrid costs less than direct vision LLM
Vision LLM input is dominated by image tokens. Two pages plus prompts run about 51,000 tokens. The hybrid sends only the OCR text, about 1,300 tokens. Input cost drops by a factor of 39. At 10,000 documents per month: $20 hybrid, $87 GPT-4o-mini direct, $228 GPT-4o full.
When to use what
- Searchable text only (RAG, archive): OCR alone.
- Structured tables, values must be exact (invoices, contracts): hybrid.
- Charts, graphs, signatures, hand-drawn marks: vision LLM direct, since OCR cannot see what is not text.
- Sub-second latency at high volume: OCR alone.
Sources
- LlamaIndex ParseBench
- Umair Ali Khan, "How to Accurately Extract Everything from Documents Using AI"
Read the full guide with the annotated test document and complete pipeline code on ai-engine.net.
Top comments (0)