How I finally got LLMs to return reliable structured data

#python #ai #tutorial #api

Let me walk you through a mess I made last month.

I was building a little internal tool to extract key details from customer PDFs — things like invoice numbers, dates, line items, totals. Classic automation stuff. The plan was simple: dump the PDF text into an LLM, ask it nicely for JSON, and boom — structured data ready to insert into our database.

Of course, it didn't work. Not even close.

The problem that wouldn't go away

The first few tests looked promising. I sent a short invoice to GPT-3.5-turbo with a prompt like:

Extract the invoice number, date, total amount, and line items from the following text. Return as JSON.

It returned perfectly formatted JSON — once. Then I tried a different invoice and got keys like "Invoice Number" (capitalized), "inv_num" (underscore), and once even "INVOICE_NUMBER" (all caps). The structure of the JSON changed every single time. Sometimes it wrapped line items in an array, sometimes as an object. My parser broke after three documents.

I tried few-shot prompting with examples. That helped a little — now the keys were consistent — but the values still had weird typos ("1,234.56" vs "1234.56"), and occasionally it hallucinated line items that didn't exist.

I spent two days tweaking prompts. Temperature to 0? Check. System prompt with explicit schema? Check. Chain-of-thought? Check. Nothing gave me the reliability I needed for production.

The dead end: prompt engineering alone

Don't get me wrong — prompt engineering is powerful. But for structured data extraction, it's fragile. A slight change in the input text (different phrasing, extra whitespace, a stray table) would cause the model to shift the output format. And since I couldn't control the PDF formatting from our clients, I was stuck.

I even tried using a JSON schema in the prompt and instructing the model to fill it. I'd write something like:

{
  "invoice_number": "string",
  "date": "date (YYYY-MM-DD)",
  "total": "number"
}

But the model would interpret "date (YYYY-MM-DD)" as a literal string, or ignore the constraints when the source text was ambiguous. Validation was happening after the fact, and by then it was too late — the data was already corrupted.

What finally worked: function calling + runtime validation

The breakthrough came when I ditched the idea of getting perfect JSON from a single prompt. Instead, I used the LLM's ability to call functions — a feature available in GPT-4 and later models — and paired it with runtime schema validation.

Here's the basic idea:

Define a function with a strict schema (using something like Pydantic in Python).
Tell the model it can call this function to return its structured output.
After the model returns a function call argument (as JSON), validate that JSON against the schema.
If validation fails, feed the error back to the model and ask it to retry.

This turned a one-shot generation into a loop where the model could correct itself based on precise error messages.

Code example

I built this in Python using OpenAI's chat completions with function calling. Here's a simplified version:

import json
from pydantic import BaseModel, Field, ValidationError
from openai import OpenAI

client = OpenAI()

# Define the schema using Pydantic
class Invoice(BaseModel):
    invoice_number: str = Field(description="Unique invoice identifier")
    date: str = Field(description="Invoice date in YYYY-MM-DD format")
    total: float = Field(description="Total amount due")
    line_items: list[dict] = Field(description="Array of {description, quantity, unit_price, total}")

# Convert to OpenAI function definition
invoice_schema = Invoice.model_json_schema()

functions = [
    {
        "name": "extract_invoice",
        "description": "Extract invoice details from text",
        "parameters": invoice_schema
    }
]

def extract_invoice_with_retry(text, max_retries=3):
    messages = [
        {"role": "system", "content": "You are an invoice extraction assistant. Return the data by calling the extract_invoice function."},
        {"role": "user", "content": text}
    ]
    for attempt in range(max_retries):
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            functions=functions,
            function_call={"name": "extract_invoice"}
        )
        msg = response.choices[0].message
        if msg.function_call:
            try:
                args = json.loads(msg.function_call.arguments)
                # Validate against Pydantic model
                invoice = Invoice(**args)
                return invoice.model_dump()
            except (json.JSONDecodeError, ValidationError) as e:
                # Feed error back to model
                messages.append(msg)
                messages.append({
                    "role": "function",
                    "name": "extract_invoice",
                    "content": f"Validation error: {e}. Please fix the output."
                })
                continue
    raise Exception("Failed after retries")

The key lines are the ValidationError catch. Instead of just logging the error, I send it back as a fake function response. The model then adjusts its output. In my tests, this loop rarely needed more than one retry.

Lessons learned

Validation must be part of the loop, not a post-processing step. Errors feed back to the model, making it self-correct.
Function calling constraints the output format far better than prompts. The model knows it must produce valid JSON that matches the schema defined in the function parameters.
Pydantic's schema generation is a lifesaver. It gives you a single source of truth for both runtime validation and the OpenAI function definition.
But it's not magic. Complex schemas (nested arrays, optional fields with defaults) can still confuse the model sometimes. Keep your schema as flat and simple as possible.

Trade-offs and when NOT to use this

This approach adds latency (one retry ≈ double the cost and time). If you need speed above all else — say, real-time chat — a simpler prompt-only approach might be better, even with occasional errors.

Also, function calling currently works best with GPT-4 and Claude 3. Cheaper models like GPT-3.5-turbo don't support it as reliably. If your budget is tight, you might need to fall back to prompt engineering.

Another limitation: the function calling API is still evolving. OpenAI's recent change to "tools" (instead of "functions") means the code above may need updates. Always check the latest docs.

Finally, this approach assumes your schema can be expressed as JSON Schema. Some domain-specific constraints (e.g., "total must equal sum of line items") aren't easily enforced at the schema level. You'll need additional validation logic outside the LLM loop.

What I'd do differently next time

I'd start with function calling right away instead of spending days on prompt tweaks. I'd also plan for multiple schemas — some invoices have discounts or tax lines — and handle that by defining separate functions or using a union schema.

Also, I'd log every failed attempt and the correction message. That data is gold for improving the prompt or adjusting the schema.

Closing thoughts

Reliable structured data from LLMs is possible, but it requires acknowledging that LLMs are imperfect. Instead of fighting that imperfection with clever prompts, I learned to work with it — by giving the model a way to correct itself. The function calling + validation loop isn't flashy, but it works.

Now I'm curious: how do you handle structured output from LLMs? Have you tried something similar, or do you have a completely different approach? Let me know in the comments — I'm always looking to improve this loop.