When Regex Isn't Enough: Extracting Structured Data with LLMs

#ai #python #tutorial #webdev

I hit a wall last month. A client wanted me to scrape product details from fifty different e-commerce sites — prices, sizes, colors, descriptions. I’ve done scraping before, so I started confidently: BeautifulSoup, CSS selectors, regex patterns for price formats.

Two hours in, I was staring at a mess. One site had prices as $19.99, another as 19,99 €. Sizes were <select> dropdowns on one page, radio buttons on another, and free-text on a third. Every site had its own quirks. By the end of the day, I had fragile regex for each domain, and the client wanted to add ten more sites next week. I knew this wouldn’t scale.

What I tried (and why it hurt)

First, I doubled down on patterns. I wrote more aggressive regex to capture £19.99 and 19.99 GBP and $19.99 – $29.99. It turned into a nest of alternations and named groups. I kept a mental list of edge cases I missed — like a price that included a discount badge inline (was $29.99, now $19.99).

Then I tried visual parsing: headless Chrome with Playwright, waiting for the DOM to stabilize, then walking the tree looking for text nodes that looked like prices. It was slow and still brittle. A site redesign would break everything.

I almost gave up and hired a data entry person. But I wanted a solution that could handle any site, even ones I hadn’t seen yet.

The approach that finally worked: structured LLM output

I’d been playing with GPT-4 for code generation, but I hadn’t considered using it for data extraction. The idea was simple: give the LLM the raw HTML (or a cleaned text version) and a schema of what I wanted, and ask it to output JSON.

The key was function calling (or tool use). Instead of asking the model to write free-form JSON, I defined the exact fields I wanted, with types and descriptions, as a function. Then the model returns a structured object that I can validate programmatically.

Here’s how I set it up in Python:

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional
import json

client = OpenAI(api_key="sk-...")

# Define the schema
class ProductInfo(BaseModel):
    name: str = Field(description="Product full name")
    price: str = Field(description="Current price as displayed, e.g. '$19.99'")
    original_price: Optional[str] = Field(None, description="Original price if discounted")
    currency: str = Field("USD", description="Currency code (USD, EUR, GBP, etc.)")
    colors: List[str] = Field(default_factory=list, description="Available colors")
    sizes: List[str] = Field(default_factory=list, description="Available sizes or dimensions")
    description: str = Field(description="Short product description")

# Convert Pydantic model to OpenAI function definition
extraction_function = {
    "name": "extract_product",
    "description": "Extract structured product info from HTML",
    "parameters": ProductInfo.model_json_schema()
}

def extract_from_html(html: str) -> ProductInfo:
    # We strip most of the markup to focus on visible text
    # In practice, you might use a text extraction library like trafilatura
    plain_text = strip_html(html)[:3000]  # Truncate to avoid token limits

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a data extraction assistant. Extract the requested product information from the provided text. If a field is not present, leave it null or empty."
            },
            {
                "role": "user",
                "content": f"Extract product info from this text:\n\n{plain_text}"
            }
        ],
        tools=[{"type": "function", "function": extraction_function}],
        tool_choice={"type": "function", "function": {"name": "extract_product"}}
    )

    tool_call = response.choices[0].message.tool_calls[0]
    arguments = json.loads(tool_call.function.arguments)
    return ProductInfo.model_validate(arguments)

I tested it on a page from a random shoe store. The first call returned name="Running Sneaker X", price="$89.99", colors=["Black","White"]. Perfect. I tried it on three more sites — different markup, different wording — and it extracted correctly every time.

The reality check: it's not magic

This approach saved my project, but it has real trade-offs:

Cost: Every extraction costs a few tokens. For a few hundred pages, it’s cheap (maybe a few cents). Scraping millions of products? The bill adds up quickly.
Latency: Each call takes 1–3 seconds. You can batch or parallelize, but you’re still slower than a regex-based solution that runs in microseconds.
Hallucinations: The model might invent a price if it’s not visible on the page. I solved this by strict schema validation: if a required field is missing or has garbage, I re-run with a more specific prompt.
Prompt sensitivity: If you don’t give clear instructions, the model might include extra text like “The price is…” instead of “$19.99”. You need to iterate on your system prompt.

Despite these, for dynamic, heterogeneous content, it’s the best tool I’ve found. I ended up wrapping the logic into a small internal service — essentially the same pattern as something like ai.interwestinfo.com’s extraction API, but with my own schema definitions and retry logic.

Lessons learned

Don’t build regex for every site if you can use AI. It’s faster to set up and more resilient to layout changes.
Always validate the output. Pydantic’s model_validate catches type errors and missing fields immediately.
Clean the HTML first. Full HTML wastes tokens. Use a readability library to get the main content block.
Keep a retry loop with feedback. If validation fails, send the error message back to the model and ask it to correct the output.

When NOT to do this

If you’re scraping a single source with a stable structure, just write a targeted CSS selector. If you have millions of pages and a strict budget, invest in a training a smaller model or use traditional parsing. LLMs are for the messy, unstructured edge cases.

What I’d do differently next time

I’d start with a few representative pages and build a test suite. I’d also experiment with cheaper models like gpt-4o-mini or even local ones (e.g., Llama 3 with tool calling) to reduce latency and cost. And I’d move the schema definitions into a config file so non-developers can tweak them without touching code.

This method turned my nightmare project into a two-day task. The data is now flowing into the client’s database, and I sleep better knowing a site redesign won’t break everything.

What’s your experience with LLMs for data extraction? Have you had better luck with structured output or free-form prompts?