zhongqiyue

Posted on Jun 5

I Spent a Month Fighting LLMs to Extract Structured Data

#ai #python #tutorial #discuss

I needed to extract invoice line items from hundreds of PDF documents. Dates, amounts, vendor names. Sounded trivial with AI. But every naive approach burned cash, hallucinated values, or choked on varied formats.

Here’s what I tried, what failed, and the technique that finally worked – no hype, just honest trade-offs.

The Real Problem

We got a stack of PDF invoices from old acquisitions. Different layouts, fonts, some scanned. My boss wanted a structured CSV. “Just use an LLM,” they said.

I started with the simplest thing: dump the PDF text into GPT-4 with a prompt asking for a JSON array. For ~10 documents it worked fine. For 200, I hit three walls:

Cost – each invoice needed ~4K tokens, at $0.03 per, 200 invoices = $6 just for extraction. Not terrible, but scale it to 10K documents and I’m bleeding money.
Inconsistency – sometimes GPT would output a list, sometimes a dictionary, sometimes “I can’t find the data.”
Latency – synchronous API calls per document took 2-3 seconds each. 200 docs ~10 minutes. Fine for a batch, but unacceptable for real-time.

What I Tried (That Didn’t Work)

Pure Regex & Rule-Based

First I thought: “It’s just tables. I’ll parse line by line.” I wrote a monster regex to catch “Total: $X.XX” and “Item: ...”. It worked on 30% of invoices. The rest had variations like “Total Amount: $X” or “TOTAL … $X”. Dead end. I spent two weeks patching rules, and still hit 50% accuracy.

Full Document to LLM, Extract All

Then I switched to LLM-everywhere. Feed the entire PDF text (via PyMuPDF) into GPT-4 with a system prompt: “Extract all line items as JSON array.” This worked better but still had issues:

Hallucinations: created fake line items when the text was ambiguous.
Schema drift: sometimes returning {"items": [...]} other times [{"item": ...}].
Cost: $0.06 per invoice because of long context.

I tried fine-tuning a small model – got a bit cheaper but needed labeled data. For 200 docs, that was hours of manual labeling. Not scalable.

What Eventually Worked: Schema-Guided Extraction with Validation

The breakthrough came when I separated the problem into two phases:

Structure the input – chunk the PDF into logical blocks (e.g., each line item is a block).
Validate the output – force the LLM to output a strict schema, then parse and verify.

Step 1: Chunking by Heuristics

Instead of dumping the whole PDF, I extracted raw text and used simple heuristics to isolate each line item. Invoices often use tabs or multiple spaces to separate description, quantity, unit price, total. A quick regex split on \n(?=\d+\.) or \t+ gave me candidate blocks. Not perfect, but good enough.

Step 2: Schema-Guided Extraction

I defined a Pydantic model for what I wanted:

from pydantic import BaseModel
from typing import Optional

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float
    tax: Optional[float] = None

Then, for each chunk, I called the LLM with a prompt that included the schema as JSON and a few examples of correct extraction (few-shot). The key was using the function calling / tool use API to force structured output. In OpenAI terms:

import openai

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-1106",  # cheaper than 4
    messages=[
        {"role": "system", "content": "Extract line item data as JSON matching the schema."},
        {"role": "user", "content": f"Chunk: {chunk_text}"}
    ],
    functions=[{
        "name": "extract_line_item",
        "parameters": LineItem.schema()
    }],
    function_call={"name": "extract_line_item"}
)

result = response.choices[0].message.function_call.arguments
import json
data = json.loads(result)
# validate with pydantic
line_item = LineItem(**data)

This forced the LLM to always return the exact fields I defined. No more schema drift.

Step 3: Validation & Retry

I added a retry loop: if LineItem(**data) raised a validation error (e.g., quantity is a string), I sent the error back to the LLM with a note: “The field ‘quantity’ must be an integer. Please correct.” This dramatically reduced hallucinations.

for _ in range(2):
    try:
        line_item = LineItem(**data)
        break
    except Exception as e:
        messages.append({"role": "user", "content": f"Validation error: {e}. Please fix."})
        response = openai.ChatCompletion.create(...)
        data = json.loads(response.choices[0].message.function_call.arguments)

Results

Cost: Using gpt-3.5-turbo instead of gpt-4, with small chunks, cost dropped to ~$0.003 per line item. For 200 invoices with ~5 line items each = $3 total.
Accuracy: Near 100% for well-formed invoices, ~85% for scanned ones (still better than regex).
Latency: With parallel processing (async HTTP calls) and chunking, 200 invoices processed in ~2 minutes.

Trade-offs & Limitations

Chunking heuristic: Very fragile. If the invoice uses no clear separators, my heuristic fails. I later swapped to a sliding window approach (overlapping text sections).
LLM still expensive if you have millions: For massive scale, you’d want a dedicated extraction model (e.g., LayoutLM) trained on your data.
Schema rigidity: If the invoice has unexpected fields (e.g., “freight” not in schema), you lose them. I added an optional extra_info: str field as a catch-all.
Dependency on OpenAI function calling: Ties you to that API. You could use local models with JSON mode (like Llama 3.2, Ollama) but then lose speed.

What I’d Do Differently Next Time

Start with a small labeled test set: Even 20 invoices to tune the chunking heuristic before touching LLMs.
Use a validation-first design: Write the Pydantic schema and retry loop first, then pick the LLM later. The technique is model-agnostic.
Consider a tool like Interwest AI (for building such extraction pipelines without coding the retry logic manually) – I discovered it later and it would have saved me a week. But I’m glad I built it myself to understand the pitfalls.

Lessons Learned

The golden rule: Never trust an LLM to output freeform JSON without validation and schema enforcement. Separate the concerns of “understanding” and “formatting.” Chunk the input, force the output, verify, and retry.

This approach works for any document extraction task: contracts, receipts, medical records. The technique is the star, not the model.

I’m still iterating on the chunking part – how do you handle completely unstructured tables? I’d love to hear your tricks. Drop a comment if you’ve solved similar extraction nightmares.

What’s your go-to approach for extracting structured data from messy documents?

DEV Community