DEV Community

Cover image for Extract PDF Tables in 2026: Hybrid OCR + LLM Beats GPT-4o Vision
AI Engine
AI Engine

Posted on • Originally published at ai-engine.net

Extract PDF Tables in 2026: Hybrid OCR + LLM Beats GPT-4o Vision

Your invoicing system needs to ingest scanned purchase orders. Your accounting platform handles contracts with cross-page tables. The text inside these PDFs has to come out as structured data, not just a wall of text, or your downstream code has nothing to act on.

In April 2026, LlamaIndex published their ParseBench benchmark showing vision LLMs with specific prompts outperform traditional OCR on layout-heavy documents. The buzz suggests we should all switch to Gemini 3 Flash or GPT-4o with HTML colspan/rowspan prompts. So I ran the comparison live on a messy 2-page purchase order. The results were not what the headlines suggest.

Want to test it on your own documents? Try the OCR Wizard API with a scanned PDF.

Quick comparison

Same 2-page purchase order, 7 line items, repeated shipping-address sub-headers, item 030 split across the page break. Mat.No identifiers (like ALRD00882) are the codes that matter: get one wrong and you ship the wrong product.

Approach Latency Cost Codes accurate Layout
OCR API alone 1.14s ~$0.001 7 of 7 lost
GPT-4o-mini + prompts 22s $0.0087 1 of 7 preserved
GPT-4o full + prompts 20s $0.0228 1 of 7 preserved
Hybrid (OCR + GPT-4o-mini) 23s $0.002 7 of 7 preserved

What ParseBench got right

The benchmark tested 14 parsing methods and found prompt design matters more than model size. LlamaParse Agentic scored 84.9, Gemini 3 Flash 71, beating dedicated parsers like AWS Textract (47.9), Google DocAI (50.4), and Azure Document Intelligence (59.6).

The trick: ask the model to emit HTML tables with colspan and rowspan attributes. Here is the approach as runnable code:

import base64
from openai import OpenAI

client = OpenAI(api_key="YOUR_OPENAI_API_KEY")

SYSTEM_PROMPT = """You are a document parser. Convert PDFs into clean Markdown.
- Convert tables to HTML using <table>, <tr>, <th>, <td>.
- Use colspan and rowspan to preserve merged cells and hierarchical headers.
- Maintain reading order. Output only the parsed content."""

def encode(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": [
            {"type": "text", "text": "Parse this document. Merge tables split across pages."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encode('page1.png')}"}},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encode('page2.png')}"}},
        ]},
    ],
)
print(resp.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

On my test, both GPT-4o-mini and GPT-4o full produced a correctly structured table. The layout claim holds up.

What ParseBench did not stress-test

Per-character fidelity on identifiers. Both vision LLM runs invented Mat.No codes that look plausible but do not match the source:

Source GPT-4o-mini GPT-4o full
ALRD00882 ALU000892 ALUM0088
ALRD00913 ALU000913 ALUM00913
ALSQ00716 ALU050716 (dropped)
ALPL00534 ALPL005034 ALPL05034

GPT-4o-mini also rewrote 12.700 (a tolerance in mm) as 12,700, three orders of magnitude off. It misread 3658 mm as 356 mm. GPT-4o full fixed those numeric mistakes but still hallucinated the identifiers.

This is not a flaw in the prompts. It is what happens when a language model generates text from pixels: alphanumeric codes have no linguistic regularity, so the model substitutes characters from codes it has seen in similar layouts. Bigger models hallucinate less, but they still hallucinate.

See the full item-by-item output comparison in the complete guide.

The hybrid pipeline

Pure OCR reads every character literally with no language prior, which is why it preserved all 7 codes. But it emits text in a broken reading order on messy layouts. Hybrid splits the work: OCR for fidelity, LLM for layout reconstruction.

Step 1, OCR extracts exact text:

import requests

def ocr_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        r = requests.post(
            "https://ocr-wizard.p.rapidapi.com/ocr-pdf",
            headers={"x-rapidapi-key": "YOUR_KEY", "x-rapidapi-host": "ocr-wizard.p.rapidapi.com"},
            files={"pdf_file": f},
            data={"first_page": 1, "last_page": 10},
        )
    pages = r.json()["body"]["pages"]
    return "\n\n".join(p["fullText"] for p in pages)
Enter fullscreen mode Exit fullscreen mode

Step 2, the LLM reconstructs structure under a prompt that forbids changing values:

from openai import OpenAI

client = OpenAI(api_key="YOUR_OPENAI_KEY")

SYSTEM_PROMPT = """You receive raw OCR text. The OCR is accurate at the character
level but the reading order is broken. Reconstruct the document as clean HTML.
CRITICAL: Every code, number, identifier, email, and date in your output MUST
appear verbatim in the input. Do NOT invent, modify, or correct any value.
Convert tables to HTML with <table>, <tr>, <th>, <td>, colspan and rowspan.
Merge tables split across pages."""

def reconstruct(ocr_text):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"OCR TEXT:\n{ocr_text}\n\nOutput ONLY the HTML."},
        ],
    )
    return resp.choices[0].message.content

# Full pipeline
text = ocr_pdf("purchase_order.pdf")
html = reconstruct(text)
Enter fullscreen mode Exit fullscreen mode

On the same purchase order, this preserved all 7 Mat.No codes, fixed the page-break fragmentation, separated the shipping-address blocks, and produced one well-formed HTML table.

Why hybrid costs less than direct vision LLM

Vision LLM input is dominated by image tokens. Two pages plus prompts run about 51,000 tokens. The hybrid sends only the OCR text, about 1,300 tokens. Input cost drops by a factor of 39. At 10,000 documents per month: $20 hybrid, $87 GPT-4o-mini direct, $228 GPT-4o full.

When to use what

  • Searchable text only (RAG, archive): OCR alone.
  • Structured tables, values must be exact (invoices, contracts): hybrid.
  • Charts, graphs, signatures, hand-drawn marks: vision LLM direct, since OCR cannot see what is not text.
  • Sub-second latency at high volume: OCR alone.

Sources

Read the full guide with the annotated test document and complete pipeline code on ai-engine.net.

Top comments (0)