A few months ago, I was deep into a project that required extracting structured invoice data from messy PDFs. I had a pipeline: OCR the PDF, feed the text to GPT-4 with a detailed prompt, and get back a JSON object with fields like invoice_number, total_amount, line_items. Simple enough, right?
Wrong. The output was all over the place. Some responses had the right keys but swapped the values. Others returned markdown-wrapped JSON with extra text. A few were just plain invalid JSON. And even when the syntax was perfect, numbers were sometimes strings or fields were missing. I was spending more time debugging the LLM's output than actually building features.
What I tried (and why it didn't work)
First, I tried prompt engineering. I added explicit instructions:
"Return ONLY valid JSON. Use double quotes. The
invoice_numbermust be a string. Thetotal_amountmust be a float."
I stuffed the prompt with few-shot examples, lowered the temperature to 0.1, even tried different models. Still, about 5–10% of responses were broken in some way. For a production system that processes thousands of invoices daily, that's a disaster.
Then I tried post-processing with regular expressions to extract JSON from markdown, then json.loads() with try/except. That handled syntax errors, but not semantic ones (wrong types, missing fields). I could not rely on the LLM to follow schema exactly every time.
The approach that actually worked
I realized that instead of trying to make the LLM perfect on the first try, I should treat it as an iterative process. The idea: generate an initial output, validate it against a strict schema, and if it fails, feed the error back to the LLM and ask it to fix the specific issue. Repeat until valid or a max retry limit.
This is essentially a self-correcting pipeline – a common pattern in production AI systems. You don't need a special API; you just need a good validation library and a loop.
Here's what I built (simplified for this article):
import json
from pydantic import BaseModel, ValidationError
from typing import List, Optional
from openai import OpenAI
# Define your expected schema
class InvoiceItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
date: str
total_amount: float
line_items: List[InvoiceItem]
client = OpenAI()
def generate_invoice_json(raw_text: str) -> Optional[Invoice]:
prompt = f"""Extract invoice data from the following text. Return ONLY JSON.
Schema:
{Invoice.schema_json(indent=2)}
Text:
{raw_text}
"""
max_retries = 3
for attempt in range(max_retries):
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
)
raw = response.choices[0].message.content
# Try to parse JSON
try:
# Strip markdown code fences if present
raw_clean = raw.strip().removeprefix("```
json").removesuffix("
```").strip()
data = json.loads(raw_clean)
validated = Invoice(**data)
return validated
except (json.JSONDecodeError, ValidationError) as e:
error_msg = str(e)
print(f"Attempt {attempt+1} failed: {error_msg}")
# Update prompt with error details
prompt += f"\n\nPrevious attempt failed. Here is the error: {error_msg}\nPlease fix the JSON accordingly."
# All retries exhausted
return None
This simple loop catches both syntax errors and semantic violations (wrong types, missing fields) because Pydantic validates the nested structure. The error message from Pydantic is very descriptive: field 'line_items' -> 0 -> 'quantity': value is not a valid integer – exactly what the LLM needs to correct.
Where this falls short
While it works surprisingly well, there are trade-offs:
- Cost: Each retry means another API call. If 10% of invoices need a retry, you're paying 10% more. For high-volume use, this adds up.
- Latency: A retry can triple the response time. If you need real-time results, consider limiting retries or using a faster (cheaper) model for the correction step.
- Endless loops: Sometimes the LLM keeps making the same mistake or introduces new ones. I cap retries at 3, but even then a small percentage will still fail. For those, I fall back to a manual review queue.
- Model dependence: GPT-4 is good at following correction instructions; weaker models might not improve the output. You may need to use a different model for the correction call.
Making it more robust
I later refined the pipeline:
- Use structured output (function calling) when available – it gives you directly parseable JSON and reduces syntax issues.
- For correction, I use a separate, cheaper model (e.g., GPT-3.5) just to fix the JSON parse errors, and keep the main model for extraction. Cuts costs.
- Cache the correction results for identical errors during a batch run.
Some managed AI services (like the one at InterwestInfo) offer built-in validation and retry logic out of the box, which is nice if you don't want to build it yourself. But the technique is universal – you can implement it with any LLM API and any validation library.
When NOT to use this approach
- If your schema is extremely simple (just a single string), retries are overkill.
- If you're running offline batch processing with long deadlines, maybe just increase temperature and regenerate multiple times until one passes.
- If latency is critical (sub-second responses), you're better off investing in better prompt engineering or fine-tuning.
What I'd do differently next time
I'd start with function calling from day one. OpenAI's function calling returns structured arguments directly, which eliminates JSON parsing issues. Then I'd still validate with Pydantic, but only for semantic correctness – less retrying needed.
Also, I'd log every attempt and error to a database. That data is gold for improving prompts or fine-tuning the model later.
Over to you
Building reliable LLM-powered pipelines is a craft. The self-correction pattern is one of many tools in the box. How do you handle inconsistent outputs in your projects? Do you use retries, fallback models, or some other trick?
I'd love to hear what's working (or not) for you.
Top comments (0)