zhongqiyue

Posted on Jun 28

From Regex Hell to AI: How I Finally Tamed Messy PDF Invoices

#webdev #python #ai #tutorial

Last month, I spent three days wrestling with 500 PDF invoices. Each one had the same data—vendor name, invoice number, total amount—but the layouts were all over the place. Different fonts, missing headers, tables that somehow broke across pages. I tried regex. I tried OCR with layout analysis. I even tried building a rule-based parser that looked for keywords like "Total:" .

Nothing worked reliably. Every time I fixed one pattern, another invoice broke. I was one commit away from throwing my laptop out the window.

Then I took a step back. I realized I didn't need to understand every layout variation. I just needed to understand the data. And that's where AI came in.

What didn’t work

Let me be clear: I tried the usual suspects first.

Regex. Classic. I wrote patterns like r"Total\s*:\s*\$?(\d+\.\d{2})". Worked on 60% of invoices. The rest had "Total Due" or "Amount Total" or the dollar sign in a different place. Regex is great when you control the input. I didn't.

OCR with layout parsing. I used Tesseract with --psm 6 and tried to extract lines by bounding boxes. It helped a bit, but tables with merged cells or rotated text threw it off. Plus, I had to write code to guess which box was a field name and which was a value.

Rule-based parser. I built a dictionary of known vendors and their layouts. That worked … until I got an invoice from a new vendor. Maintenance became a nightmare.

I was solving the wrong problem. Instead of fighting formatting, I needed to focus on meaning.

The AI approach that saved me

I remembered that large language models are surprisingly good at understanding context. If I could give the model the raw text from a PDF and a description of what I wanted, maybe it could extract the fields directly.

Here’s the core idea: treat extraction as a structured generation task. Provide a prompt with a few examples (few-shot) or just describe the schema, and let the model output JSON.

I found an API that did exactly this with a simple HTTP call. (Full disclosure: I used Interwest AI because it had a free tier and a straightforward endpoint.) But the technique works with any LLM that supports function calling or JSON mode—OpenAI, Anthropic, local models, etc.

Step 1: Extract raw text from PDF

I used PyMuPDF (fitz) to grab all text in order.

import fitz  # PyMuPDF

def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

This gives me a big string, with no table structure. That’s fine—the AI will figure it out.

Step 2: Build the prompt

I defined a clear JSON schema for the output I wanted.

extraction_schema = {
    "vendor_name": "string",
    "invoice_number": "string",
    "invoice_date": "YYYY-MM-DD string",
    "total_amount": "number (float)",
    "currency": "string (e.g. USD, EUR)",
    "line_items": [{"description": "string", "quantity": "number", "unit_price": "number", "amount": "number"}]
}

Then I wrote a system prompt that explains the task and provides two examples.

system_prompt = f"""
You are a data extraction assistant. Extract the requested fields from the invoice text below.
Output ONLY valid JSON matching this schema:
{json.dumps(extraction_schema, indent=2)}

If a field is missing, use null. For line_items, include all items mentioned in the invoice.
"""

Step 3: Call the API

Here’s the function that sends the document text and prompt to the AI API.

import requests
import json

API_URL = "https://ai.interwestinfo.com/v1/extract"  # or any compatible endpoint
API_KEY = "your_key_here"

def extract_invoice_data(text):
    payload = {
        "model": "gpt-4o-mini",  # or whatever model the API supports
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Invoice text:\n\n{text}"}
        ],
        "response_format": {"type": "json_object"}
    }
    headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
    resp = requests.post(API_URL, json=payload, headers=headers)
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

Step 4: Validate and fall back

AI outputs aren’t perfect. I added a validation step that checks for required fields and retries with a simpler prompt if the JSON is malformed.

import json

def validate_and_parse(raw_output):
    try:
        data = json.loads(raw_output)
    except json.JSONDecodeError:
        print("AI returned invalid JSON. Falling back to retry...")
        return None
    if data.get("total_amount") is None:
        print("Missing critical field. Marking for manual review.")
        return None
    return data

Then I loop through all invoices, collect results, and export to CSV.

Results

Out of 500 invoices, the AI correctly extracted all requested fields for 478 (95.6%). The remaining 22 had issues: usually a missing line item or a hallucinated date. I set those aside for manual review. Total time: ~45 minutes of processing (with a 3-second delay per request) plus 30 minutes of manual fixes. Way better than three days of regex.

Lessons learned and trade-offs

This approach isn't magic. Here’s what I wish I’d known:

Cost. Each request costs about $0.01–$0.02 with gpt-4o-mini. For 500 invoices, that’s ~$10. Fine for a one-off, but think about scale.
Latency. ~2-4 seconds per request. With parallelisation you can speed it up, but hitting API rate limits is real.
Hallucinations. The model sometimes fills in a missing date with a plausible one. Always validate your output against known constraints.
Prompt sensitivity. Small changes in wording can change extraction accuracy. Test prompts on a sample first.
Data privacy. If you're sending sensitive invoice data to an external API, ensure compliance (GDPR, HIPAA, etc.). Consider using a local LLM like Llama 3 if needed.

When not to use this: If you have thousands of identical PDFs with zero variation, a regex + OCR pipeline is cheaper and faster. If you need real-time extraction (<1 second), this isn’t it. Also, if you don’t have a clear schema or need nested relationships, the AI can get confused.

What I’d do differently next time

Use structured outputs via function calling. Many APIs now let you define the exact JSON schema and the model will obey it strictly, reducing malformed responses.
Batch similar documents together to save on prompt tokens and cost.
Add a classification step first to identify the document type, then route to a specialised extraction prompt.
Cache results so re-processing the same PDF doesn’t cost again.

This whole experience changed how I approach extractive tasks. Instead of trying to reverse-engineer every layout, I let the model understand the content. It’s not perfect, but it’s the closest thing to a universal parser I’ve seen.

What messy data problem are you dealing with right now? Have you tried throwing an LLM at it? I’d love to hear your war stories.

Top comments (1)

UnitBuilds • Jun 28

I actually built a pipeline for financial document parsing for Doccit (Autonomous Accounting Suite). What worked well, is to separate the 'process pdf' problem into it's constituent parts. The biggest split being when to use LLM and when not to. You can get 99% of the way there using a standard OCR, you get document layouts, you get 99% accurate text, etc. Then you use the LLM to fix any low confidence snippets and to semantically understand the nature of the document. You can also use formulaic correction to fix numeric drifts, eg. if Sub-total is OCR'd wrong, You use Total - Tax to derive the sub-total, which saves you a LLM call. You can also mosaic the snippets to make it cheaper (1 call per stack of docs, instead of 100), you can also use sifts to filter the documents into buckets, so you dont need the LLM to classify entire documents, you let it write the rules that identify the document.